All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] testpmd: add nanosleep in main loop
@ 2017-11-10  6:02 Marcelo Tosatti
  2017-11-10  9:12 ` Adrien Mazarguil
  0 siblings, 1 reply; 14+ messages in thread
From: Marcelo Tosatti @ 2017-11-10  6:02 UTC (permalink / raw)
  To: dev; +Cc: Luiz Capitulino, Daniel Bristot de Oliveira


This patch allows a configurable pair of values to be set, which
controls
the frequency and length of a nanosleep call performed at test-pmd's
iofwd main loop.

The problem is the following: it is necessary to execute code
on isolated CPUs which is not part of the packet forwarding load.

For example:

 "echo val > /sys/kernel/debug/tracing/buffer_size_kb"

hangs the process, because the DPDK thread has higher 
priority than the workqueue thread which executes the flush from 
CPU local tracebuffer to CPU global trace buffer [the workitem
in case].

There are more serious issues than the trace-cmd bug, such as XFS 
workitems failing to execute causing filesystem corruption.

To workaround this problem, until a proper kernel
solution is developed, allow DPDK to nanosleep 
(hopefully with a small enough frequency and interval 
so that the performance is within acceptable levels).

The new parameters are:

*  --delay-hz: sets nanosleep frequency in Hz.
*  --delay-length: sets nanosleep length in ns.

Results for delay-hz=100,delay-length=10000 (which allows 
the buffer_size_kb change to complete):

Baseline run-1:
[Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 49505, Average:
19008.7 ns, StdDev: 2501.0 ns, Quartiles: 17293.0/18330.0/19901.0 ns

Baseline run-2:
[Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 49606, Average:
19036.4 ns, StdDev: 2485.2 ns, Quartiles: 17318.0/18349.0/19936.0 ns

Baseline run-3:
[Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 49627, Average:
19019.2 ns, StdDev: 2503.7 ns, Quartiles: 17323.0/18355.0/19940.0 ns

============================

(10.000us, 100HZ)

Run-1:
[Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 7284, Average:
20830.6 ns, StdDev: 12023.0 ns, Quartiles: 17309.0/18394.0/20233.0 ns

Run-2:
[Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 6272, Average:
20897.1 ns, StdDev: 12057.2 ns, Quartiles: 17389.0/18457.0/20266.0 ns

Run-3:
[Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 4843, Average:
20535.2 ns, StdDev: 9827.3 ns, Quartiles: 17389.0/18441.0/20269.0 ns


Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


diff -Nur dpdk-17.08.orig/app/test-pmd/iofwd.c dpdk-17.08/app/test-pmd/iofwd.c
--- dpdk-17.08.orig/app/test-pmd/iofwd.c	2017-10-30 22:45:37.829492673 -0200
+++ dpdk-17.08/app/test-pmd/iofwd.c	2017-10-30 22:45:48.321522581 -0200
@@ -64,9 +64,30 @@
 #include <rte_ethdev.h>
 #include <rte_string_fns.h>
 #include <rte_flow.h>
+#include <time.h>
 
 #include "testpmd.h"
 
+uint32_t nanosleep_interval;
+
+static void calc_nanosleep_interval(int hz)
+{
+	uint64_t cycles_per_sec = rte_get_timer_hz();
+	nanosleep_interval = cycles_per_sec/hz;
+}
+
+static void do_nanosleep(void)
+{
+	struct timespec req;
+
+	req.tv_sec = 0;
+	req.tv_nsec = nanosleep_length;
+
+	nanosleep(&req, NULL);
+
+	return;
+}
+
 /*
  * Forwarding of packets in I/O mode.
  * Forward packets "as-is".
@@ -81,6 +102,10 @@
 	uint16_t nb_tx;
 	uint32_t retry;
 
+
+	if (nanosleep_interval == 0 && nanosleep_frequency > 0)
+		calc_nanosleep_interval(nanosleep_frequency);
+
 #ifdef RTE_TEST_PMD_RECORD_CORE_CYCLES
 	uint64_t start_tsc;
 	uint64_t end_tsc;
@@ -91,6 +116,12 @@
 	start_tsc = rte_rdtsc();
 #endif
 
+	if (nanosleep_frequency > 0 &&
+	    rte_get_timer_cycles() > fs->next_nanosleep) {
+		do_nanosleep();
+		fs->next_nanosleep = rte_get_timer_cycles() + nanosleep_interval;
+	}
+
 	/*
 	 * Receive a burst of packets and forward them.
 	 */
diff -Nur dpdk-17.08.orig/app/test-pmd/parameters.c dpdk-17.08/app/test-pmd/parameters.c
--- dpdk-17.08.orig/app/test-pmd/parameters.c	2017-10-30 22:45:37.830492676 -0200
+++ dpdk-17.08/app/test-pmd/parameters.c	2017-10-30 22:46:33.708651912 -0200
@@ -216,6 +216,8 @@
 	       "disable print of designated event or all of them.\n");
 	printf("  --flow-isolate-all: "
 	       "requests flow API isolated mode on all ports at initialization time.\n");
+	printf("  --delay-hz: sets nanosleep frequency in Hz.\n"); 
+	printf("  --delay-length: sets nanosleep length in ns.\n"); 
 }
 
 #ifdef RTE_LIBRTE_CMDLINE
@@ -638,7 +640,9 @@
 		{ "no-rmv-interrupt",		0, 0, 0 },
 		{ "print-event",		1, 0, 0 },
 		{ "mask-event",			1, 0, 0 },
-		{ 0, 0, 0, 0 },
+		{ "delay-hz",			1, 0, 0 },
+		{ "delay-length",		1, 0, 0 },
+	 	{ 0, 0, 0, 0 },
 	};
 
 	argvopt = argv;
@@ -1099,6 +1103,27 @@
 				else
 					rte_exit(EXIT_FAILURE, "bad txpkts\n");
 			}
+
+			if (!strcmp(lgopts[opt_idx].name, "delay-hz")) {
+				int n;
+
+				n = atoi(optarg);
+
+				if (n < 0)
+					rte_exit(EXIT_FAILURE, "bad delay-hz\n");
+				nanosleep_frequency = n;
+			}
+
+			if (!strcmp(lgopts[opt_idx].name, "delay-length")) {
+				int n;
+
+				n = atoi(optarg);
+
+				if (n < 0)
+					rte_exit(EXIT_FAILURE, "bad delay-length\n");
+				nanosleep_length = n;
+			}
+
 			if (!strcmp(lgopts[opt_idx].name, "no-flush-rx"))
 				no_flush_rx = 1;
 			if (!strcmp(lgopts[opt_idx].name, "disable-link-check"))
diff -Nur dpdk-17.08.orig/app/test-pmd/testpmd.c dpdk-17.08/app/test-pmd/testpmd.c
--- dpdk-17.08.orig/app/test-pmd/testpmd.c	2017-10-30 22:45:37.829492673 -0200
+++ dpdk-17.08/app/test-pmd/testpmd.c	2017-10-30 22:45:48.323522591 -0200
@@ -327,6 +327,13 @@
 
 #endif
 
+
+/* How long to sleep in packet processing */
+uint32_t nanosleep_length;
+
+/* How often to sleep in packet processing */
+uint32_t nanosleep_frequency;
+
 /*
  * Ethernet device configuration.
  */
diff -Nur dpdk-17.08.orig/app/test-pmd/testpmd.h dpdk-17.08/app/test-pmd/testpmd.h
--- dpdk-17.08.orig/app/test-pmd/testpmd.h	2017-10-30 22:45:37.829492673 -0200
+++ dpdk-17.08/app/test-pmd/testpmd.h	2017-10-30 22:45:48.323522591 -0200
@@ -127,6 +127,7 @@
 	struct pkt_burst_stats rx_burst_stats;
 	struct pkt_burst_stats tx_burst_stats;
 #endif
+	uint64_t next_nanosleep;
 };
 
 /** Offload IP checksum in csum forward engine */
@@ -390,6 +391,9 @@
 extern lcoreid_t latencystats_lcore_id;
 #endif
 
+extern uint32_t nanosleep_length;
+extern uint32_t nanosleep_frequency;
+
 #ifdef RTE_LIBRTE_BITRATE
 extern lcoreid_t bitrate_lcore_id;
 extern uint8_t bitrate_enabled;

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] testpmd: add nanosleep in main loop
  2017-11-10  6:02 [PATCH] testpmd: add nanosleep in main loop Marcelo Tosatti
@ 2017-11-10  9:12 ` Adrien Mazarguil
  2017-11-10 10:13   ` Daniel Bristot de Oliveira
                     ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Adrien Mazarguil @ 2017-11-10  9:12 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: dev, Luiz Capitulino, Daniel Bristot de Oliveira

Hi Marcelo,

On Fri, Nov 10, 2017 at 04:02:10AM -0200, Marcelo Tosatti wrote:
> 
> This patch allows a configurable pair of values to be set, which
> controls
> the frequency and length of a nanosleep call performed at test-pmd's
> iofwd main loop.
> 
> The problem is the following: it is necessary to execute code
> on isolated CPUs which is not part of the packet forwarding load.
> 
> For example:
> 
>  "echo val > /sys/kernel/debug/tracing/buffer_size_kb"
> 
> hangs the process, because the DPDK thread has higher 
> priority than the workqueue thread which executes the flush from 
> CPU local tracebuffer to CPU global trace buffer [the workitem
> in case].
> 
> There are more serious issues than the trace-cmd bug, such as XFS 
> workitems failing to execute causing filesystem corruption.
> 
> To workaround this problem, until a proper kernel
> solution is developed, allow DPDK to nanosleep 
> (hopefully with a small enough frequency and interval 
> so that the performance is within acceptable levels).

I understand the need to do something about it, however the nanosleep()
approach seems questionable to me.

Testpmd's forwarding modes (particularly I/O) are used for benchmarking
purposes by many and are therefore sensitive to change. This code path is
currently free from system calls for that reason and nanosleep() is an
expensive one by definition. Even if optional or called at a low frequency,
the presence of this new code has an impact.

Since testpmd is a development tool not supposed to run in a production
environment, is there really a need for it to be patched to work around a
(temporary) Linux kernel bug?

If so, why is I/O the only forwarding mode impacted?

If it's used in a production environment and such a fix can't wait, have
other workarounds been considered:

- Replacing testpmd in I/O mode with a physical cable or switch?

- Using proper options on the kernel command line as described in [1], such
  as isolcpus, rcu_nocbs, nohz_full?

[1] doc/guides/howto/pvp_reference_benchmark.rst

> 
> The new parameters are:
> 
> *  --delay-hz: sets nanosleep frequency in Hz.
> *  --delay-length: sets nanosleep length in ns.
> 
> Results for delay-hz=100,delay-length=10000 (which allows 
> the buffer_size_kb change to complete):
> 
> Baseline run-1:
> [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 49505, Average:
> 19008.7 ns, StdDev: 2501.0 ns, Quartiles: 17293.0/18330.0/19901.0 ns
> 
> Baseline run-2:
> [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 49606, Average:
> 19036.4 ns, StdDev: 2485.2 ns, Quartiles: 17318.0/18349.0/19936.0 ns
> 
> Baseline run-3:
> [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 49627, Average:
> 19019.2 ns, StdDev: 2503.7 ns, Quartiles: 17323.0/18355.0/19940.0 ns
> 
> ============================
> 
> (10.000us, 100HZ)
> 
> Run-1:
> [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 7284, Average:
> 20830.6 ns, StdDev: 12023.0 ns, Quartiles: 17309.0/18394.0/20233.0 ns
> 
> Run-2:
> [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 6272, Average:
> 20897.1 ns, StdDev: 12057.2 ns, Quartiles: 17389.0/18457.0/20266.0 ns
> 
> Run-3:
> [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 4843, Average:
> 20535.2 ns, StdDev: 9827.3 ns, Quartiles: 17389.0/18441.0/20269.0 ns
> 
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> 
> diff -Nur dpdk-17.08.orig/app/test-pmd/iofwd.c dpdk-17.08/app/test-pmd/iofwd.c
> --- dpdk-17.08.orig/app/test-pmd/iofwd.c	2017-10-30 22:45:37.829492673 -0200
> +++ dpdk-17.08/app/test-pmd/iofwd.c	2017-10-30 22:45:48.321522581 -0200
> @@ -64,9 +64,30 @@
>  #include <rte_ethdev.h>
>  #include <rte_string_fns.h>
>  #include <rte_flow.h>
> +#include <time.h>
>  
>  #include "testpmd.h"
>  
> +uint32_t nanosleep_interval;
> +
> +static void calc_nanosleep_interval(int hz)
> +{
> +	uint64_t cycles_per_sec = rte_get_timer_hz();
> +	nanosleep_interval = cycles_per_sec/hz;
> +}
> +
> +static void do_nanosleep(void)
> +{
> +	struct timespec req;
> +
> +	req.tv_sec = 0;
> +	req.tv_nsec = nanosleep_length;
> +
> +	nanosleep(&req, NULL);
> +
> +	return;
> +}
> +
>  /*
>   * Forwarding of packets in I/O mode.
>   * Forward packets "as-is".
> @@ -81,6 +102,10 @@
>  	uint16_t nb_tx;
>  	uint32_t retry;
>  
> +
> +	if (nanosleep_interval == 0 && nanosleep_frequency > 0)
> +		calc_nanosleep_interval(nanosleep_frequency);
> +
>  #ifdef RTE_TEST_PMD_RECORD_CORE_CYCLES
>  	uint64_t start_tsc;
>  	uint64_t end_tsc;
> @@ -91,6 +116,12 @@
>  	start_tsc = rte_rdtsc();
>  #endif
>  
> +	if (nanosleep_frequency > 0 &&
> +	    rte_get_timer_cycles() > fs->next_nanosleep) {
> +		do_nanosleep();
> +		fs->next_nanosleep = rte_get_timer_cycles() + nanosleep_interval;
> +	}
> +
>  	/*
>  	 * Receive a burst of packets and forward them.
>  	 */
> diff -Nur dpdk-17.08.orig/app/test-pmd/parameters.c dpdk-17.08/app/test-pmd/parameters.c
> --- dpdk-17.08.orig/app/test-pmd/parameters.c	2017-10-30 22:45:37.830492676 -0200
> +++ dpdk-17.08/app/test-pmd/parameters.c	2017-10-30 22:46:33.708651912 -0200
> @@ -216,6 +216,8 @@
>  	       "disable print of designated event or all of them.\n");
>  	printf("  --flow-isolate-all: "
>  	       "requests flow API isolated mode on all ports at initialization time.\n");
> +	printf("  --delay-hz: sets nanosleep frequency in Hz.\n"); 
> +	printf("  --delay-length: sets nanosleep length in ns.\n"); 
>  }
>  
>  #ifdef RTE_LIBRTE_CMDLINE
> @@ -638,7 +640,9 @@
>  		{ "no-rmv-interrupt",		0, 0, 0 },
>  		{ "print-event",		1, 0, 0 },
>  		{ "mask-event",			1, 0, 0 },
> -		{ 0, 0, 0, 0 },
> +		{ "delay-hz",			1, 0, 0 },
> +		{ "delay-length",		1, 0, 0 },
> +	 	{ 0, 0, 0, 0 },
>  	};
>  
>  	argvopt = argv;
> @@ -1099,6 +1103,27 @@
>  				else
>  					rte_exit(EXIT_FAILURE, "bad txpkts\n");
>  			}
> +
> +			if (!strcmp(lgopts[opt_idx].name, "delay-hz")) {
> +				int n;
> +
> +				n = atoi(optarg);
> +
> +				if (n < 0)
> +					rte_exit(EXIT_FAILURE, "bad delay-hz\n");
> +				nanosleep_frequency = n;
> +			}
> +
> +			if (!strcmp(lgopts[opt_idx].name, "delay-length")) {
> +				int n;
> +
> +				n = atoi(optarg);
> +
> +				if (n < 0)
> +					rte_exit(EXIT_FAILURE, "bad delay-length\n");
> +				nanosleep_length = n;
> +			}
> +
>  			if (!strcmp(lgopts[opt_idx].name, "no-flush-rx"))
>  				no_flush_rx = 1;
>  			if (!strcmp(lgopts[opt_idx].name, "disable-link-check"))
> diff -Nur dpdk-17.08.orig/app/test-pmd/testpmd.c dpdk-17.08/app/test-pmd/testpmd.c
> --- dpdk-17.08.orig/app/test-pmd/testpmd.c	2017-10-30 22:45:37.829492673 -0200
> +++ dpdk-17.08/app/test-pmd/testpmd.c	2017-10-30 22:45:48.323522591 -0200
> @@ -327,6 +327,13 @@
>  
>  #endif
>  
> +
> +/* How long to sleep in packet processing */
> +uint32_t nanosleep_length;
> +
> +/* How often to sleep in packet processing */
> +uint32_t nanosleep_frequency;
> +
>  /*
>   * Ethernet device configuration.
>   */
> diff -Nur dpdk-17.08.orig/app/test-pmd/testpmd.h dpdk-17.08/app/test-pmd/testpmd.h
> --- dpdk-17.08.orig/app/test-pmd/testpmd.h	2017-10-30 22:45:37.829492673 -0200
> +++ dpdk-17.08/app/test-pmd/testpmd.h	2017-10-30 22:45:48.323522591 -0200
> @@ -127,6 +127,7 @@
>  	struct pkt_burst_stats rx_burst_stats;
>  	struct pkt_burst_stats tx_burst_stats;
>  #endif
> +	uint64_t next_nanosleep;
>  };
>  
>  /** Offload IP checksum in csum forward engine */
> @@ -390,6 +391,9 @@
>  extern lcoreid_t latencystats_lcore_id;
>  #endif
>  
> +extern uint32_t nanosleep_length;
> +extern uint32_t nanosleep_frequency;
> +
>  #ifdef RTE_LIBRTE_BITRATE
>  extern lcoreid_t bitrate_lcore_id;
>  extern uint8_t bitrate_enabled;
> 
> 
> 
> 
> 
> 
> 
> 
> 

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] testpmd: add nanosleep in main loop
  2017-11-10  9:12 ` Adrien Mazarguil
@ 2017-11-10 10:13   ` Daniel Bristot de Oliveira
  2017-11-10 10:14   ` Ananyev, Konstantin
  2017-11-11  3:45   ` Marcelo Tosatti
  2 siblings, 0 replies; 14+ messages in thread
From: Daniel Bristot de Oliveira @ 2017-11-10 10:13 UTC (permalink / raw)
  To: Adrien Mazarguil, Marcelo Tosatti; +Cc: dev, Luiz Capitulino

On 11/10/2017 10:12 AM, Adrien Mazarguil wrote:
> Since testpmd is a development tool not supposed to run in a production
> environment, is there really a need for it to be patched to work around a
> (temporary) Linux kernel bug?

>From the kernel side... not even...

> If so, why is I/O the only forwarding mode impacted?
> 
> If it's used in a production environment and such a fix can't wait, have
> other workarounds been considered:
> 
> - Replacing testpmd in I/O mode with a physical cable or switch?

using proper options like:

> - Using proper options on the kernel command line as described in [1], such
>   as isolcpus, rcu_nocbs, nohz_full?

Guarantees you that a CPU is completely isolated. In the current state
of the art, it is not possible to assume that a CPU can be fully
isolated from OS housekeeping threads.

For example, some kernel sub-systems rely on executing on every CPU,
e.g., using kworkers, and they are not only tracing or debugging
options. That case Marcelo showed is just a straightforward to use use-case.

If a busy-loop-isolated task runs with rt priority, it will end up
delaying such workers to run, making system to complain about hung tasks.

-- Daniel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] testpmd: add nanosleep in main loop
  2017-11-10  9:12 ` Adrien Mazarguil
  2017-11-10 10:13   ` Daniel Bristot de Oliveira
@ 2017-11-10 10:14   ` Ananyev, Konstantin
  2017-11-10 10:42     ` Daniel Bristot de Oliveira
  2017-11-11  3:49     ` Marcelo Tosatti
  2017-11-11  3:45   ` Marcelo Tosatti
  2 siblings, 2 replies; 14+ messages in thread
From: Ananyev, Konstantin @ 2017-11-10 10:14 UTC (permalink / raw)
  To: Adrien Mazarguil, Marcelo Tosatti
  Cc: dev, Luiz Capitulino, Daniel Bristot de Oliveira



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Adrien Mazarguil
> Sent: Friday, November 10, 2017 9:12 AM
> To: Marcelo Tosatti <mtosatti@redhat.com>
> Cc: dev@dpdk.org; Luiz Capitulino <lcapitulino@redhat.com>; Daniel Bristot de Oliveira <bristot@redhat.com>
> Subject: Re: [dpdk-dev] [PATCH] testpmd: add nanosleep in main loop
> 
> Hi Marcelo,
> 
> On Fri, Nov 10, 2017 at 04:02:10AM -0200, Marcelo Tosatti wrote:
> >
> > This patch allows a configurable pair of values to be set, which
> > controls
> > the frequency and length of a nanosleep call performed at test-pmd's
> > iofwd main loop.
> >
> > The problem is the following: it is necessary to execute code
> > on isolated CPUs which is not part of the packet forwarding load.
> >
> > For example:
> >
> >  "echo val > /sys/kernel/debug/tracing/buffer_size_kb"
> >
> > hangs the process, because the DPDK thread has higher
> > priority than the workqueue thread which executes the flush from
> > CPU local tracebuffer to CPU global trace buffer [the workitem
> > in case].
> >
> > There are more serious issues than the trace-cmd bug, such as XFS
> > workitems failing to execute causing filesystem corruption.
> >
> > To workaround this problem, until a proper kernel
> > solution is developed, allow DPDK to nanosleep
> > (hopefully with a small enough frequency and interval
> > so that the performance is within acceptable levels).
> 
> I understand the need to do something about it, however the nanosleep()
> approach seems questionable to me.
> 
> Testpmd's forwarding modes (particularly I/O) are used for benchmarking
> purposes by many and are therefore sensitive to change. This code path is
> currently free from system calls for that reason and nanosleep() is an
> expensive one by definition. Even if optional or called at a low frequency,
> the presence of this new code has an impact.
> 
> Since testpmd is a development tool not supposed to run in a production
> environment, is there really a need for it to be patched to work around a
> (temporary) Linux kernel bug?
> 
> If so, why is I/O the only forwarding mode impacted?
> 
> If it's used in a production environment and such a fix can't wait, have
> other workarounds been considered:
> 
> - Replacing testpmd in I/O mode with a physical cable or switch?
> 
> - Using proper options on the kernel command line as described in [1], such
>   as isolcpus, rcu_nocbs, nohz_full?
> 
> [1] doc/guides/howto/pvp_reference_benchmark.rst


Agree with Adrian here - the patch doesn't fix the problem in any case,
while introducing an unnecessary slowdown in testpmd iofwd mode.
Please think up some other approach.
Konstantin

> 
> >
> > The new parameters are:
> >
> > *  --delay-hz: sets nanosleep frequency in Hz.
> > *  --delay-length: sets nanosleep length in ns.
> >
> > Results for delay-hz=100,delay-length=10000 (which allows
> > the buffer_size_kb change to complete):
> >
> > Baseline run-1:
> > [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 49505, Average:
> > 19008.7 ns, StdDev: 2501.0 ns, Quartiles: 17293.0/18330.0/19901.0 ns
> >
> > Baseline run-2:
> > [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 49606, Average:
> > 19036.4 ns, StdDev: 2485.2 ns, Quartiles: 17318.0/18349.0/19936.0 ns
> >
> > Baseline run-3:
> > [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 49627, Average:
> > 19019.2 ns, StdDev: 2503.7 ns, Quartiles: 17323.0/18355.0/19940.0 ns
> >
> > ============================
> >
> > (10.000us, 100HZ)
> >
> > Run-1:
> > [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 7284, Average:
> > 20830.6 ns, StdDev: 12023.0 ns, Quartiles: 17309.0/18394.0/20233.0 ns
> >
> > Run-2:
> > [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 6272, Average:
> > 20897.1 ns, StdDev: 12057.2 ns, Quartiles: 17389.0/18457.0/20266.0 ns
> >
> > Run-3:
> > [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 4843, Average:
> > 20535.2 ns, StdDev: 9827.3 ns, Quartiles: 17389.0/18441.0/20269.0 ns
> >
> >
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> >
> >
> > diff -Nur dpdk-17.08.orig/app/test-pmd/iofwd.c dpdk-17.08/app/test-pmd/iofwd.c
> > --- dpdk-17.08.orig/app/test-pmd/iofwd.c	2017-10-30 22:45:37.829492673 -0200
> > +++ dpdk-17.08/app/test-pmd/iofwd.c	2017-10-30 22:45:48.321522581 -0200
> > @@ -64,9 +64,30 @@
> >  #include <rte_ethdev.h>
> >  #include <rte_string_fns.h>
> >  #include <rte_flow.h>
> > +#include <time.h>
> >
> >  #include "testpmd.h"
> >
> > +uint32_t nanosleep_interval;
> > +
> > +static void calc_nanosleep_interval(int hz)
> > +{
> > +	uint64_t cycles_per_sec = rte_get_timer_hz();
> > +	nanosleep_interval = cycles_per_sec/hz;
> > +}
> > +
> > +static void do_nanosleep(void)
> > +{
> > +	struct timespec req;
> > +
> > +	req.tv_sec = 0;
> > +	req.tv_nsec = nanosleep_length;
> > +
> > +	nanosleep(&req, NULL);
> > +
> > +	return;
> > +}
> > +
> >  /*
> >   * Forwarding of packets in I/O mode.
> >   * Forward packets "as-is".
> > @@ -81,6 +102,10 @@
> >  	uint16_t nb_tx;
> >  	uint32_t retry;
> >
> > +
> > +	if (nanosleep_interval == 0 && nanosleep_frequency > 0)
> > +		calc_nanosleep_interval(nanosleep_frequency);
> > +
> >  #ifdef RTE_TEST_PMD_RECORD_CORE_CYCLES
> >  	uint64_t start_tsc;
> >  	uint64_t end_tsc;
> > @@ -91,6 +116,12 @@
> >  	start_tsc = rte_rdtsc();
> >  #endif
> >
> > +	if (nanosleep_frequency > 0 &&
> > +	    rte_get_timer_cycles() > fs->next_nanosleep) {
> > +		do_nanosleep();
> > +		fs->next_nanosleep = rte_get_timer_cycles() + nanosleep_interval;
> > +	}
> > +
> >  	/*
> >  	 * Receive a burst of packets and forward them.
> >  	 */
> > diff -Nur dpdk-17.08.orig/app/test-pmd/parameters.c dpdk-17.08/app/test-pmd/parameters.c
> > --- dpdk-17.08.orig/app/test-pmd/parameters.c	2017-10-30 22:45:37.830492676 -0200
> > +++ dpdk-17.08/app/test-pmd/parameters.c	2017-10-30 22:46:33.708651912 -0200
> > @@ -216,6 +216,8 @@
> >  	       "disable print of designated event or all of them.\n");
> >  	printf("  --flow-isolate-all: "
> >  	       "requests flow API isolated mode on all ports at initialization time.\n");
> > +	printf("  --delay-hz: sets nanosleep frequency in Hz.\n");
> > +	printf("  --delay-length: sets nanosleep length in ns.\n");
> >  }
> >
> >  #ifdef RTE_LIBRTE_CMDLINE
> > @@ -638,7 +640,9 @@
> >  		{ "no-rmv-interrupt",		0, 0, 0 },
> >  		{ "print-event",		1, 0, 0 },
> >  		{ "mask-event",			1, 0, 0 },
> > -		{ 0, 0, 0, 0 },
> > +		{ "delay-hz",			1, 0, 0 },
> > +		{ "delay-length",		1, 0, 0 },
> > +	 	{ 0, 0, 0, 0 },
> >  	};
> >
> >  	argvopt = argv;
> > @@ -1099,6 +1103,27 @@
> >  				else
> >  					rte_exit(EXIT_FAILURE, "bad txpkts\n");
> >  			}
> > +
> > +			if (!strcmp(lgopts[opt_idx].name, "delay-hz")) {
> > +				int n;
> > +
> > +				n = atoi(optarg);
> > +
> > +				if (n < 0)
> > +					rte_exit(EXIT_FAILURE, "bad delay-hz\n");
> > +				nanosleep_frequency = n;
> > +			}
> > +
> > +			if (!strcmp(lgopts[opt_idx].name, "delay-length")) {
> > +				int n;
> > +
> > +				n = atoi(optarg);
> > +
> > +				if (n < 0)
> > +					rte_exit(EXIT_FAILURE, "bad delay-length\n");
> > +				nanosleep_length = n;
> > +			}
> > +
> >  			if (!strcmp(lgopts[opt_idx].name, "no-flush-rx"))
> >  				no_flush_rx = 1;
> >  			if (!strcmp(lgopts[opt_idx].name, "disable-link-check"))
> > diff -Nur dpdk-17.08.orig/app/test-pmd/testpmd.c dpdk-17.08/app/test-pmd/testpmd.c
> > --- dpdk-17.08.orig/app/test-pmd/testpmd.c	2017-10-30 22:45:37.829492673 -0200
> > +++ dpdk-17.08/app/test-pmd/testpmd.c	2017-10-30 22:45:48.323522591 -0200
> > @@ -327,6 +327,13 @@
> >
> >  #endif
> >
> > +
> > +/* How long to sleep in packet processing */
> > +uint32_t nanosleep_length;
> > +
> > +/* How often to sleep in packet processing */
> > +uint32_t nanosleep_frequency;
> > +
> >  /*
> >   * Ethernet device configuration.
> >   */
> > diff -Nur dpdk-17.08.orig/app/test-pmd/testpmd.h dpdk-17.08/app/test-pmd/testpmd.h
> > --- dpdk-17.08.orig/app/test-pmd/testpmd.h	2017-10-30 22:45:37.829492673 -0200
> > +++ dpdk-17.08/app/test-pmd/testpmd.h	2017-10-30 22:45:48.323522591 -0200
> > @@ -127,6 +127,7 @@
> >  	struct pkt_burst_stats rx_burst_stats;
> >  	struct pkt_burst_stats tx_burst_stats;
> >  #endif
> > +	uint64_t next_nanosleep;
> >  };
> >
> >  /** Offload IP checksum in csum forward engine */
> > @@ -390,6 +391,9 @@
> >  extern lcoreid_t latencystats_lcore_id;
> >  #endif
> >
> > +extern uint32_t nanosleep_length;
> > +extern uint32_t nanosleep_frequency;
> > +
> >  #ifdef RTE_LIBRTE_BITRATE
> >  extern lcoreid_t bitrate_lcore_id;
> >  extern uint8_t bitrate_enabled;
> >
> >
> >
> >
> >
> >
> >
> >
> >
> 
> --
> Adrien Mazarguil
> 6WIND

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] testpmd: add nanosleep in main loop
  2017-11-10 10:14   ` Ananyev, Konstantin
@ 2017-11-10 10:42     ` Daniel Bristot de Oliveira
  2017-11-10 11:14       ` Bruce Richardson
  2017-11-11  3:49     ` Marcelo Tosatti
  1 sibling, 1 reply; 14+ messages in thread
From: Daniel Bristot de Oliveira @ 2017-11-10 10:42 UTC (permalink / raw)
  To: Ananyev, Konstantin, Adrien Mazarguil, Marcelo Tosatti
  Cc: dev, Luiz Capitulino



On 11/10/2017 11:14 AM, Ananyev, Konstantin wrote:
> Agree with Adrian here - the patch doesn't fix the problem in any case,

I would agree with you if it were possible to assume one can fully
isolate a CPU on Linux... but it is not...

This:
https://lwn.net/Articles/659490/

is still an open issue, and the reason why it is an open issue is the
kernel threads that need to run on every CPU, mainly when using the
PREEMPT_RT, which turns almost everything on threads.

> while introducing an unnecessary slowdown in testpmd iofwd mode.
> Please think up some other approach.

The other approach is to increase the priority of all other threads that
run on the isolate CPU. But that is not a good idea at all, as the other
threads might preempt the busy-loop thread at the worst possible moment.

Using the knowledge of the thread about when it is the best time to give
a chance for other threads to run would be a smarter decision.

-- Daniel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] testpmd: add nanosleep in main loop
  2017-11-10 10:42     ` Daniel Bristot de Oliveira
@ 2017-11-10 11:14       ` Bruce Richardson
  2017-11-10 13:51         ` Luiz Capitulino
  2017-11-11  3:54         ` Marcelo Tosatti
  0 siblings, 2 replies; 14+ messages in thread
From: Bruce Richardson @ 2017-11-10 11:14 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: Ananyev, Konstantin, Adrien Mazarguil, Marcelo Tosatti, dev,
	Luiz Capitulino

On Fri, Nov 10, 2017 at 11:42:56AM +0100, Daniel Bristot de Oliveira wrote:
> 
> 
> On 11/10/2017 11:14 AM, Ananyev, Konstantin wrote:
> > Agree with Adrian here - the patch doesn't fix the problem in any case,
> 
> I would agree with you if it were possible to assume one can fully
> isolate a CPU on Linux... but it is not...
> 
> This:
> https://lwn.net/Articles/659490/
> 
> is still an open issue, and the reason why it is an open issue is the
> kernel threads that need to run on every CPU, mainly when using the
> PREEMPT_RT, which turns almost everything on threads.
> 
> > while introducing an unnecessary slowdown in testpmd iofwd mode.
> > Please think up some other approach.
> 
> The other approach is to increase the priority of all other threads that
> run on the isolate CPU. But that is not a good idea at all, as the other
> threads might preempt the busy-loop thread at the worst possible moment.
> 
> Using the knowledge of the thread about when it is the best time to give
> a chance for other threads to run would be a smarter decision.
> 
I don't like having this in the main loop either, and to echo others I
wouldn't have thought that testpmd was actually used as anything other
than a testing app. Also, I would have thought that running it at
realtime priority wouldn't be a good idea, because of exactly this
problem.

On the specifics of the solution, would using sched_yield() rather than
nanosleep not be more suitable, given that the reason for this sleep is
just to give the CPU to other threads?

/Bruce

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] testpmd: add nanosleep in main loop
  2017-11-10 11:14       ` Bruce Richardson
@ 2017-11-10 13:51         ` Luiz Capitulino
  2017-11-11  3:59           ` Marcelo Tosatti
  2017-11-11  3:54         ` Marcelo Tosatti
  1 sibling, 1 reply; 14+ messages in thread
From: Luiz Capitulino @ 2017-11-10 13:51 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: Daniel Bristot de Oliveira, Ananyev, Konstantin,
	Adrien Mazarguil, Marcelo Tosatti, dev

On Fri, 10 Nov 2017 11:14:51 +0000
Bruce Richardson <bruce.richardson@intel.com> wrote:

> On Fri, Nov 10, 2017 at 11:42:56AM +0100, Daniel Bristot de Oliveira wrote:
> > 
> > 
> > On 11/10/2017 11:14 AM, Ananyev, Konstantin wrote:  
> > > Agree with Adrian here - the patch doesn't fix the problem in any case,  
> > 
> > I would agree with you if it were possible to assume one can fully
> > isolate a CPU on Linux... but it is not...
> > 
> > This:
> > https://lwn.net/Articles/659490/
> > 
> > is still an open issue, and the reason why it is an open issue is the
> > kernel threads that need to run on every CPU, mainly when using the
> > PREEMPT_RT, which turns almost everything on threads.
> >   
> > > while introducing an unnecessary slowdown in testpmd iofwd mode.
> > > Please think up some other approach.  
> > 
> > The other approach is to increase the priority of all other threads that
> > run on the isolate CPU. But that is not a good idea at all, as the other
> > threads might preempt the busy-loop thread at the worst possible moment.
> > 
> > Using the knowledge of the thread about when it is the best time to give
> > a chance for other threads to run would be a smarter decision.
> >   
> I don't like having this in the main loop either, and to echo others I
> wouldn't have thought that testpmd was actually used as anything other
> than a testing app. 

That's why we're patching it. We want to be aware of the implications.
If it's not good for testpmd, it may not be good for production either.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] testpmd: add nanosleep in main loop
  2017-11-10  9:12 ` Adrien Mazarguil
  2017-11-10 10:13   ` Daniel Bristot de Oliveira
  2017-11-10 10:14   ` Ananyev, Konstantin
@ 2017-11-11  3:45   ` Marcelo Tosatti
  2 siblings, 0 replies; 14+ messages in thread
From: Marcelo Tosatti @ 2017-11-11  3:45 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: dev, Luiz Capitulino, Daniel Bristot de Oliveira

On Fri, Nov 10, 2017 at 10:12:19AM +0100, Adrien Mazarguil wrote:
> Hi Marcelo,

Hello Adrien,

> On Fri, Nov 10, 2017 at 04:02:10AM -0200, Marcelo Tosatti wrote:
> > 
> > This patch allows a configurable pair of values to be set, which
> > controls
> > the frequency and length of a nanosleep call performed at test-pmd's
> > iofwd main loop.
> > 
> > The problem is the following: it is necessary to execute code
> > on isolated CPUs which is not part of the packet forwarding load.
> > 
> > For example:
> > 
> >  "echo val > /sys/kernel/debug/tracing/buffer_size_kb"
> > 
> > hangs the process, because the DPDK thread has higher 
> > priority than the workqueue thread which executes the flush from 
> > CPU local tracebuffer to CPU global trace buffer [the workitem
> > in case].
> > 
> > There are more serious issues than the trace-cmd bug, such as XFS 
> > workitems failing to execute causing filesystem corruption.
> > 
> > To workaround this problem, until a proper kernel
> > solution is developed, allow DPDK to nanosleep 
> > (hopefully with a small enough frequency and interval 
> > so that the performance is within acceptable levels).
> 
> I understand the need to do something about it, however the nanosleep()
> approach seems questionable to me.
> 
> Testpmd's forwarding modes (particularly I/O) are used for benchmarking
> purposes by many and are therefore sensitive to change. This code path is
> currently free from system calls for that reason and nanosleep() is an
> expensive one by definition. Even if optional or called at a low frequency,
> the presence of this new code has an impact.

Did you see the results: the impact is limited.

Do you understand that without this fix, the system can be unable 
to perform certain operations, and even cause filesystem corruption?

Those issues are more important than performance.

> Since testpmd is a development tool not supposed to run in a production
> environment, is there really a need for it to be patched to work around a
> (temporary) Linux kernel bug?
> 
> If so, why is I/O the only forwarding mode impacted?

Its just an example that applications can copy from.

Its disabled by default.

> If it's used in a production environment and such a fix can't wait, have
> other workarounds been considered:
> 
> - Replacing testpmd in I/O mode with a physical cable or switch?

Why would that fix the problem?

> - Using proper options on the kernel command line as described in [1], such
>   as isolcpus, rcu_nocbs, nohz_full?

Yes proper options are being used (this is a fundamental problem).

> 
> [1] doc/guides/howto/pvp_reference_benchmark.rst
> 
> > 
> > The new parameters are:
> > 
> > *  --delay-hz: sets nanosleep frequency in Hz.
> > *  --delay-length: sets nanosleep length in ns.
> > 
> > Results for delay-hz=100,delay-length=10000 (which allows 
> > the buffer_size_kb change to complete):
> > 
> > Baseline run-1:
> > [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 49505, Average:
> > 19008.7 ns, StdDev: 2501.0 ns, Quartiles: 17293.0/18330.0/19901.0 ns
> > 
> > Baseline run-2:
> > [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 49606, Average:
> > 19036.4 ns, StdDev: 2485.2 ns, Quartiles: 17318.0/18349.0/19936.0 ns
> > 
> > Baseline run-3:
> > [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 49627, Average:
> > 19019.2 ns, StdDev: 2503.7 ns, Quartiles: 17323.0/18355.0/19940.0 ns
> > 
> > ============================
> > 
> > (10.000us, 100HZ)
> > 
> > Run-1:
> > [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 7284, Average:
> > 20830.6 ns, StdDev: 12023.0 ns, Quartiles: 17309.0/18394.0/20233.0 ns
> > 
> > Run-2:
> > [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 6272, Average:
> > 20897.1 ns, StdDev: 12057.2 ns, Quartiles: 17389.0/18457.0/20266.0 ns
> > 
> > Run-3:
> > [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 4843, Average:
> > 20535.2 ns, StdDev: 9827.3 ns, Quartiles: 17389.0/18441.0/20269.0 ns
> > 
> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > 
> > 
> > diff -Nur dpdk-17.08.orig/app/test-pmd/iofwd.c dpdk-17.08/app/test-pmd/iofwd.c
> > --- dpdk-17.08.orig/app/test-pmd/iofwd.c	2017-10-30 22:45:37.829492673 -0200
> > +++ dpdk-17.08/app/test-pmd/iofwd.c	2017-10-30 22:45:48.321522581 -0200
> > @@ -64,9 +64,30 @@
> >  #include <rte_ethdev.h>
> >  #include <rte_string_fns.h>
> >  #include <rte_flow.h>
> > +#include <time.h>
> >  
> >  #include "testpmd.h"
> >  
> > +uint32_t nanosleep_interval;
> > +
> > +static void calc_nanosleep_interval(int hz)
> > +{
> > +	uint64_t cycles_per_sec = rte_get_timer_hz();
> > +	nanosleep_interval = cycles_per_sec/hz;
> > +}
> > +
> > +static void do_nanosleep(void)
> > +{
> > +	struct timespec req;
> > +
> > +	req.tv_sec = 0;
> > +	req.tv_nsec = nanosleep_length;
> > +
> > +	nanosleep(&req, NULL);
> > +
> > +	return;
> > +}
> > +
> >  /*
> >   * Forwarding of packets in I/O mode.
> >   * Forward packets "as-is".
> > @@ -81,6 +102,10 @@
> >  	uint16_t nb_tx;
> >  	uint32_t retry;
> >  
> > +
> > +	if (nanosleep_interval == 0 && nanosleep_frequency > 0)
> > +		calc_nanosleep_interval(nanosleep_frequency);
> > +
> >  #ifdef RTE_TEST_PMD_RECORD_CORE_CYCLES
> >  	uint64_t start_tsc;
> >  	uint64_t end_tsc;
> > @@ -91,6 +116,12 @@
> >  	start_tsc = rte_rdtsc();
> >  #endif
> >  
> > +	if (nanosleep_frequency > 0 &&
> > +	    rte_get_timer_cycles() > fs->next_nanosleep) {
> > +		do_nanosleep();
> > +		fs->next_nanosleep = rte_get_timer_cycles() + nanosleep_interval;
> > +	}
> > +
> >  	/*
> >  	 * Receive a burst of packets and forward them.
> >  	 */
> > diff -Nur dpdk-17.08.orig/app/test-pmd/parameters.c dpdk-17.08/app/test-pmd/parameters.c
> > --- dpdk-17.08.orig/app/test-pmd/parameters.c	2017-10-30 22:45:37.830492676 -0200
> > +++ dpdk-17.08/app/test-pmd/parameters.c	2017-10-30 22:46:33.708651912 -0200
> > @@ -216,6 +216,8 @@
> >  	       "disable print of designated event or all of them.\n");
> >  	printf("  --flow-isolate-all: "
> >  	       "requests flow API isolated mode on all ports at initialization time.\n");
> > +	printf("  --delay-hz: sets nanosleep frequency in Hz.\n"); 
> > +	printf("  --delay-length: sets nanosleep length in ns.\n"); 
> >  }
> >  
> >  #ifdef RTE_LIBRTE_CMDLINE
> > @@ -638,7 +640,9 @@
> >  		{ "no-rmv-interrupt",		0, 0, 0 },
> >  		{ "print-event",		1, 0, 0 },
> >  		{ "mask-event",			1, 0, 0 },
> > -		{ 0, 0, 0, 0 },
> > +		{ "delay-hz",			1, 0, 0 },
> > +		{ "delay-length",		1, 0, 0 },
> > +	 	{ 0, 0, 0, 0 },
> >  	};
> >  
> >  	argvopt = argv;
> > @@ -1099,6 +1103,27 @@
> >  				else
> >  					rte_exit(EXIT_FAILURE, "bad txpkts\n");
> >  			}
> > +
> > +			if (!strcmp(lgopts[opt_idx].name, "delay-hz")) {
> > +				int n;
> > +
> > +				n = atoi(optarg);
> > +
> > +				if (n < 0)
> > +					rte_exit(EXIT_FAILURE, "bad delay-hz\n");
> > +				nanosleep_frequency = n;
> > +			}
> > +
> > +			if (!strcmp(lgopts[opt_idx].name, "delay-length")) {
> > +				int n;
> > +
> > +				n = atoi(optarg);
> > +
> > +				if (n < 0)
> > +					rte_exit(EXIT_FAILURE, "bad delay-length\n");
> > +				nanosleep_length = n;
> > +			}
> > +
> >  			if (!strcmp(lgopts[opt_idx].name, "no-flush-rx"))
> >  				no_flush_rx = 1;
> >  			if (!strcmp(lgopts[opt_idx].name, "disable-link-check"))
> > diff -Nur dpdk-17.08.orig/app/test-pmd/testpmd.c dpdk-17.08/app/test-pmd/testpmd.c
> > --- dpdk-17.08.orig/app/test-pmd/testpmd.c	2017-10-30 22:45:37.829492673 -0200
> > +++ dpdk-17.08/app/test-pmd/testpmd.c	2017-10-30 22:45:48.323522591 -0200
> > @@ -327,6 +327,13 @@
> >  
> >  #endif
> >  
> > +
> > +/* How long to sleep in packet processing */
> > +uint32_t nanosleep_length;
> > +
> > +/* How often to sleep in packet processing */
> > +uint32_t nanosleep_frequency;
> > +
> >  /*
> >   * Ethernet device configuration.
> >   */
> > diff -Nur dpdk-17.08.orig/app/test-pmd/testpmd.h dpdk-17.08/app/test-pmd/testpmd.h
> > --- dpdk-17.08.orig/app/test-pmd/testpmd.h	2017-10-30 22:45:37.829492673 -0200
> > +++ dpdk-17.08/app/test-pmd/testpmd.h	2017-10-30 22:45:48.323522591 -0200
> > @@ -127,6 +127,7 @@
> >  	struct pkt_burst_stats rx_burst_stats;
> >  	struct pkt_burst_stats tx_burst_stats;
> >  #endif
> > +	uint64_t next_nanosleep;
> >  };
> >  
> >  /** Offload IP checksum in csum forward engine */
> > @@ -390,6 +391,9 @@
> >  extern lcoreid_t latencystats_lcore_id;
> >  #endif
> >  
> > +extern uint32_t nanosleep_length;
> > +extern uint32_t nanosleep_frequency;
> > +
> >  #ifdef RTE_LIBRTE_BITRATE
> >  extern lcoreid_t bitrate_lcore_id;
> >  extern uint8_t bitrate_enabled;
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> 
> -- 
> Adrien Mazarguil
> 6WIND

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] testpmd: add nanosleep in main loop
  2017-11-10 10:14   ` Ananyev, Konstantin
  2017-11-10 10:42     ` Daniel Bristot de Oliveira
@ 2017-11-11  3:49     ` Marcelo Tosatti
  2017-11-12 23:14       ` Ananyev, Konstantin
  1 sibling, 1 reply; 14+ messages in thread
From: Marcelo Tosatti @ 2017-11-11  3:49 UTC (permalink / raw)
  To: Ananyev, Konstantin
  Cc: Adrien Mazarguil, dev, Luiz Capitulino, Daniel Bristot de Oliveira

On Fri, Nov 10, 2017 at 10:14:23AM +0000, Ananyev, Konstantin wrote:
> 
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Adrien Mazarguil
> > Sent: Friday, November 10, 2017 9:12 AM
> > To: Marcelo Tosatti <mtosatti@redhat.com>
> > Cc: dev@dpdk.org; Luiz Capitulino <lcapitulino@redhat.com>; Daniel Bristot de Oliveira <bristot@redhat.com>
> > Subject: Re: [dpdk-dev] [PATCH] testpmd: add nanosleep in main loop
> > 
> > Hi Marcelo,
> > 
> > On Fri, Nov 10, 2017 at 04:02:10AM -0200, Marcelo Tosatti wrote:
> > >
> > > This patch allows a configurable pair of values to be set, which
> > > controls
> > > the frequency and length of a nanosleep call performed at test-pmd's
> > > iofwd main loop.
> > >
> > > The problem is the following: it is necessary to execute code
> > > on isolated CPUs which is not part of the packet forwarding load.
> > >
> > > For example:
> > >
> > >  "echo val > /sys/kernel/debug/tracing/buffer_size_kb"
> > >
> > > hangs the process, because the DPDK thread has higher
> > > priority than the workqueue thread which executes the flush from
> > > CPU local tracebuffer to CPU global trace buffer [the workitem
> > > in case].
> > >
> > > There are more serious issues than the trace-cmd bug, such as XFS
> > > workitems failing to execute causing filesystem corruption.
> > >
> > > To workaround this problem, until a proper kernel
> > > solution is developed, allow DPDK to nanosleep
> > > (hopefully with a small enough frequency and interval
> > > so that the performance is within acceptable levels).
> > 
> > I understand the need to do something about it, however the nanosleep()
> > approach seems questionable to me.
> > 
> > Testpmd's forwarding modes (particularly I/O) are used for benchmarking
> > purposes by many and are therefore sensitive to change. This code path is
> > currently free from system calls for that reason and nanosleep() is an
> > expensive one by definition. Even if optional or called at a low frequency,
> > the presence of this new code has an impact.
> > 
> > Since testpmd is a development tool not supposed to run in a production
> > environment, is there really a need for it to be patched to work around a
> > (temporary) Linux kernel bug?
> > 
> > If so, why is I/O the only forwarding mode impacted?
> > 
> > If it's used in a production environment and such a fix can't wait, have
> > other workarounds been considered:
> > 
> > - Replacing testpmd in I/O mode with a physical cable or switch?
> > 
> > - Using proper options on the kernel command line as described in [1], such
> >   as isolcpus, rcu_nocbs, nohz_full?
> > 
> > [1] doc/guides/howto/pvp_reference_benchmark.rst
> 
> 
> Agree with Adrian here - the patch doesn't fix the problem in any case,

It does fix the problem as the original message describes the testing.

> while introducing an unnecessary slowdown in testpmd iofwd mode.

It is not unnecessary: it is a mandatory slowdown for any approach 
which fixes the problem, whether its in DPDK or not.

> Please think up some other approach.
> Konstantin

What characteristics are you looking for in "some other approach"? 
That DPDK is not interrupted? Impossible.

See, the thing here is that nanosleep length and frequency are
controllable, which allows an application developer to tune the values
_and_ still meet their performance metrics.

I would rather be interested in a robust system (which allows certain
maintenance tasks to execute on isolated CPUs) with deterministic
performance meeting defined goals rather than "maximum performance"
(which is completly meaningless other than for marketing purposes).


> > > The new parameters are:
> > >
> > > *  --delay-hz: sets nanosleep frequency in Hz.
> > > *  --delay-length: sets nanosleep length in ns.
> > >
> > > Results for delay-hz=100,delay-length=10000 (which allows
> > > the buffer_size_kb change to complete):
> > >
> > > Baseline run-1:
> > > [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 49505, Average:
> > > 19008.7 ns, StdDev: 2501.0 ns, Quartiles: 17293.0/18330.0/19901.0 ns
> > >
> > > Baseline run-2:
> > > [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 49606, Average:
> > > 19036.4 ns, StdDev: 2485.2 ns, Quartiles: 17318.0/18349.0/19936.0 ns
> > >
> > > Baseline run-3:
> > > [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 49627, Average:
> > > 19019.2 ns, StdDev: 2503.7 ns, Quartiles: 17323.0/18355.0/19940.0 ns
> > >
> > > ============================
> > >
> > > (10.000us, 100HZ)
> > >
> > > Run-1:
> > > [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 7284, Average:
> > > 20830.6 ns, StdDev: 12023.0 ns, Quartiles: 17309.0/18394.0/20233.0 ns
> > >
> > > Run-2:
> > > [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 6272, Average:
> > > 20897.1 ns, StdDev: 12057.2 ns, Quartiles: 17389.0/18457.0/20266.0 ns
> > >
> > > Run-3:
> > > [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 4843, Average:
> > > 20535.2 ns, StdDev: 9827.3 ns, Quartiles: 17389.0/18441.0/20269.0 ns
> > >
> > >
> > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > >
> > >
> > > diff -Nur dpdk-17.08.orig/app/test-pmd/iofwd.c dpdk-17.08/app/test-pmd/iofwd.c
> > > --- dpdk-17.08.orig/app/test-pmd/iofwd.c	2017-10-30 22:45:37.829492673 -0200
> > > +++ dpdk-17.08/app/test-pmd/iofwd.c	2017-10-30 22:45:48.321522581 -0200
> > > @@ -64,9 +64,30 @@
> > >  #include <rte_ethdev.h>
> > >  #include <rte_string_fns.h>
> > >  #include <rte_flow.h>
> > > +#include <time.h>
> > >
> > >  #include "testpmd.h"
> > >
> > > +uint32_t nanosleep_interval;
> > > +
> > > +static void calc_nanosleep_interval(int hz)
> > > +{
> > > +	uint64_t cycles_per_sec = rte_get_timer_hz();
> > > +	nanosleep_interval = cycles_per_sec/hz;
> > > +}
> > > +
> > > +static void do_nanosleep(void)
> > > +{
> > > +	struct timespec req;
> > > +
> > > +	req.tv_sec = 0;
> > > +	req.tv_nsec = nanosleep_length;
> > > +
> > > +	nanosleep(&req, NULL);
> > > +
> > > +	return;
> > > +}
> > > +
> > >  /*
> > >   * Forwarding of packets in I/O mode.
> > >   * Forward packets "as-is".
> > > @@ -81,6 +102,10 @@
> > >  	uint16_t nb_tx;
> > >  	uint32_t retry;
> > >
> > > +
> > > +	if (nanosleep_interval == 0 && nanosleep_frequency > 0)
> > > +		calc_nanosleep_interval(nanosleep_frequency);
> > > +
> > >  #ifdef RTE_TEST_PMD_RECORD_CORE_CYCLES
> > >  	uint64_t start_tsc;
> > >  	uint64_t end_tsc;
> > > @@ -91,6 +116,12 @@
> > >  	start_tsc = rte_rdtsc();
> > >  #endif
> > >
> > > +	if (nanosleep_frequency > 0 &&
> > > +	    rte_get_timer_cycles() > fs->next_nanosleep) {
> > > +		do_nanosleep();
> > > +		fs->next_nanosleep = rte_get_timer_cycles() + nanosleep_interval;
> > > +	}
> > > +
> > >  	/*
> > >  	 * Receive a burst of packets and forward them.
> > >  	 */
> > > diff -Nur dpdk-17.08.orig/app/test-pmd/parameters.c dpdk-17.08/app/test-pmd/parameters.c
> > > --- dpdk-17.08.orig/app/test-pmd/parameters.c	2017-10-30 22:45:37.830492676 -0200
> > > +++ dpdk-17.08/app/test-pmd/parameters.c	2017-10-30 22:46:33.708651912 -0200
> > > @@ -216,6 +216,8 @@
> > >  	       "disable print of designated event or all of them.\n");
> > >  	printf("  --flow-isolate-all: "
> > >  	       "requests flow API isolated mode on all ports at initialization time.\n");
> > > +	printf("  --delay-hz: sets nanosleep frequency in Hz.\n");
> > > +	printf("  --delay-length: sets nanosleep length in ns.\n");
> > >  }
> > >
> > >  #ifdef RTE_LIBRTE_CMDLINE
> > > @@ -638,7 +640,9 @@
> > >  		{ "no-rmv-interrupt",		0, 0, 0 },
> > >  		{ "print-event",		1, 0, 0 },
> > >  		{ "mask-event",			1, 0, 0 },
> > > -		{ 0, 0, 0, 0 },
> > > +		{ "delay-hz",			1, 0, 0 },
> > > +		{ "delay-length",		1, 0, 0 },
> > > +	 	{ 0, 0, 0, 0 },
> > >  	};
> > >
> > >  	argvopt = argv;
> > > @@ -1099,6 +1103,27 @@
> > >  				else
> > >  					rte_exit(EXIT_FAILURE, "bad txpkts\n");
> > >  			}
> > > +
> > > +			if (!strcmp(lgopts[opt_idx].name, "delay-hz")) {
> > > +				int n;
> > > +
> > > +				n = atoi(optarg);
> > > +
> > > +				if (n < 0)
> > > +					rte_exit(EXIT_FAILURE, "bad delay-hz\n");
> > > +				nanosleep_frequency = n;
> > > +			}
> > > +
> > > +			if (!strcmp(lgopts[opt_idx].name, "delay-length")) {
> > > +				int n;
> > > +
> > > +				n = atoi(optarg);
> > > +
> > > +				if (n < 0)
> > > +					rte_exit(EXIT_FAILURE, "bad delay-length\n");
> > > +				nanosleep_length = n;
> > > +			}
> > > +
> > >  			if (!strcmp(lgopts[opt_idx].name, "no-flush-rx"))
> > >  				no_flush_rx = 1;
> > >  			if (!strcmp(lgopts[opt_idx].name, "disable-link-check"))
> > > diff -Nur dpdk-17.08.orig/app/test-pmd/testpmd.c dpdk-17.08/app/test-pmd/testpmd.c
> > > --- dpdk-17.08.orig/app/test-pmd/testpmd.c	2017-10-30 22:45:37.829492673 -0200
> > > +++ dpdk-17.08/app/test-pmd/testpmd.c	2017-10-30 22:45:48.323522591 -0200
> > > @@ -327,6 +327,13 @@
> > >
> > >  #endif
> > >
> > > +
> > > +/* How long to sleep in packet processing */
> > > +uint32_t nanosleep_length;
> > > +
> > > +/* How often to sleep in packet processing */
> > > +uint32_t nanosleep_frequency;
> > > +
> > >  /*
> > >   * Ethernet device configuration.
> > >   */
> > > diff -Nur dpdk-17.08.orig/app/test-pmd/testpmd.h dpdk-17.08/app/test-pmd/testpmd.h
> > > --- dpdk-17.08.orig/app/test-pmd/testpmd.h	2017-10-30 22:45:37.829492673 -0200
> > > +++ dpdk-17.08/app/test-pmd/testpmd.h	2017-10-30 22:45:48.323522591 -0200
> > > @@ -127,6 +127,7 @@
> > >  	struct pkt_burst_stats rx_burst_stats;
> > >  	struct pkt_burst_stats tx_burst_stats;
> > >  #endif
> > > +	uint64_t next_nanosleep;
> > >  };
> > >
> > >  /** Offload IP checksum in csum forward engine */
> > > @@ -390,6 +391,9 @@
> > >  extern lcoreid_t latencystats_lcore_id;
> > >  #endif
> > >
> > > +extern uint32_t nanosleep_length;
> > > +extern uint32_t nanosleep_frequency;
> > > +
> > >  #ifdef RTE_LIBRTE_BITRATE
> > >  extern lcoreid_t bitrate_lcore_id;
> > >  extern uint8_t bitrate_enabled;
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > 
> > --
> > Adrien Mazarguil
> > 6WIND

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] testpmd: add nanosleep in main loop
  2017-11-10 11:14       ` Bruce Richardson
  2017-11-10 13:51         ` Luiz Capitulino
@ 2017-11-11  3:54         ` Marcelo Tosatti
  1 sibling, 0 replies; 14+ messages in thread
From: Marcelo Tosatti @ 2017-11-11  3:54 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: Daniel Bristot de Oliveira, Ananyev, Konstantin,
	Adrien Mazarguil, dev, Luiz Capitulino

On Fri, Nov 10, 2017 at 11:14:51AM +0000, Bruce Richardson wrote:
> On Fri, Nov 10, 2017 at 11:42:56AM +0100, Daniel Bristot de Oliveira wrote:
> > 
> > 
> > On 11/10/2017 11:14 AM, Ananyev, Konstantin wrote:
> > > Agree with Adrian here - the patch doesn't fix the problem in any case,
> > 
> > I would agree with you if it were possible to assume one can fully
> > isolate a CPU on Linux... but it is not...
> > 
> > This:
> > https://lwn.net/Articles/659490/
> > 
> > is still an open issue, and the reason why it is an open issue is the
> > kernel threads that need to run on every CPU, mainly when using the
> > PREEMPT_RT, which turns almost everything on threads.
> > 
> > > while introducing an unnecessary slowdown in testpmd iofwd mode.
> > > Please think up some other approach.
> > 
> > The other approach is to increase the priority of all other threads that
> > run on the isolate CPU. But that is not a good idea at all, as the other
> > threads might preempt the busy-loop thread at the worst possible moment.
> > 
> > Using the knowledge of the thread about when it is the best time to give
> > a chance for other threads to run would be a smarter decision.
> > 
> I don't like having this in the main loop either, and to echo others I
> wouldn't have thought that testpmd was actually used as anything other
> than a testing app. Also, I would have thought that running it at
> realtime priority wouldn't be a good idea, because of exactly this
> problem.

Bruce,

Its just as an example for application developers, in the official DPDK
repository. And its disabled by default so it does not affect the
performance numbers.

That said, you have a problem integrating the patch?

> On the specifics of the solution, would using sched_yield() rather than
> nanosleep not be more suitable, given that the reason for this sleep is
> just to give the CPU to other threads?
> 
> /Bruce

Yes but unfortunately "sched_yield()" does not return in a timely
fashion, we would need a new "sched_yield_time(X us)" with a 
guaranteed return from the call in a maximum of X us.

(Note the values of nanosleep_frequency=100Hz, nanosleep_length=10us, 
does increase the latency values a bit but still maintains acceptable
latency). The arguments about performance must be considered
against this results.




Yes Bruce sched_yield() would be mkore

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] testpmd: add nanosleep in main loop
  2017-11-10 13:51         ` Luiz Capitulino
@ 2017-11-11  3:59           ` Marcelo Tosatti
  2017-11-11  4:01             ` Marcelo Tosatti
  0 siblings, 1 reply; 14+ messages in thread
From: Marcelo Tosatti @ 2017-11-11  3:59 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Bruce Richardson, Daniel Bristot de Oliveira, Ananyev,
	Konstantin, Adrien Mazarguil, dev

On Fri, Nov 10, 2017 at 08:51:02AM -0500, Luiz Capitulino wrote:
> On Fri, 10 Nov 2017 11:14:51 +0000
> Bruce Richardson <bruce.richardson@intel.com> wrote:
> 
> > On Fri, Nov 10, 2017 at 11:42:56AM +0100, Daniel Bristot de Oliveira wrote:
> > > 
> > > 
> > > On 11/10/2017 11:14 AM, Ananyev, Konstantin wrote:  
> > > > Agree with Adrian here - the patch doesn't fix the problem in any case,  
> > > 
> > > I would agree with you if it were possible to assume one can fully
> > > isolate a CPU on Linux... but it is not...
> > > 
> > > This:
> > > https://lwn.net/Articles/659490/
> > > 
> > > is still an open issue, and the reason why it is an open issue is the
> > > kernel threads that need to run on every CPU, mainly when using the
> > > PREEMPT_RT, which turns almost everything on threads.
> > >   
> > > > while introducing an unnecessary slowdown in testpmd iofwd mode.
> > > > Please think up some other approach.  
> > > 
> > > The other approach is to increase the priority of all other threads that
> > > run on the isolate CPU. But that is not a good idea at all, as the other
> > > threads might preempt the busy-loop thread at the worst possible moment.
> > > 
> > > Using the knowledge of the thread about when it is the best time to give
> > > a chance for other threads to run would be a smarter decision.
> > >   
> > I don't like having this in the main loop either, and to echo others I
> > wouldn't have thought that testpmd was actually used as anything other
> > than a testing app. 
> 
> That's why we're patching it. We want to be aware of the implications.
> If it's not good for testpmd, it may not be good for production either.

The arguments raised so far against the patch have been:

1) Performance is reduced.
Reply: 
* Of course performance is reduced, but any solution will also 
reduce performance similary.
* Performance is reduced but within the acceptable limits set by
NFV standards. So the performance reduction argument is kind
of not an issue (in my POV).

2) Testpmd is a test application.

Well, if one would like to avoid XFS corruption or other similar
results caused by the not possibility of running poll mode testpmd
(while testing) then he should enable the options (which are disabled
by default). Moreover, testpmd is an example application used by
production developers, so it should be integrated to testpmd.


Does anyone have arguments against the reasoning above ? 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] testpmd: add nanosleep in main loop
  2017-11-11  3:59           ` Marcelo Tosatti
@ 2017-11-11  4:01             ` Marcelo Tosatti
  0 siblings, 0 replies; 14+ messages in thread
From: Marcelo Tosatti @ 2017-11-11  4:01 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Bruce Richardson, Daniel Bristot de Oliveira, Ananyev,
	Konstantin, Adrien Mazarguil, dev

On Sat, Nov 11, 2017 at 01:59:21AM -0200, Marcelo Tosatti wrote:
> On Fri, Nov 10, 2017 at 08:51:02AM -0500, Luiz Capitulino wrote:
> > On Fri, 10 Nov 2017 11:14:51 +0000
> > Bruce Richardson <bruce.richardson@intel.com> wrote:
> > 
> > > On Fri, Nov 10, 2017 at 11:42:56AM +0100, Daniel Bristot de Oliveira wrote:
> > > > 
> > > > 
> > > > On 11/10/2017 11:14 AM, Ananyev, Konstantin wrote:  
> > > > > Agree with Adrian here - the patch doesn't fix the problem in any case,  
> > > > 
> > > > I would agree with you if it were possible to assume one can fully
> > > > isolate a CPU on Linux... but it is not...
> > > > 
> > > > This:
> > > > https://lwn.net/Articles/659490/
> > > > 
> > > > is still an open issue, and the reason why it is an open issue is the
> > > > kernel threads that need to run on every CPU, mainly when using the
> > > > PREEMPT_RT, which turns almost everything on threads.
> > > >   
> > > > > while introducing an unnecessary slowdown in testpmd iofwd mode.
> > > > > Please think up some other approach.  
> > > > 
> > > > The other approach is to increase the priority of all other threads that
> > > > run on the isolate CPU. But that is not a good idea at all, as the other
> > > > threads might preempt the busy-loop thread at the worst possible moment.
> > > > 
> > > > Using the knowledge of the thread about when it is the best time to give
> > > > a chance for other threads to run would be a smarter decision.
> > > >   
> > > I don't like having this in the main loop either, and to echo others I
> > > wouldn't have thought that testpmd was actually used as anything other
> > > than a testing app. 
> > 
> > That's why we're patching it. We want to be aware of the implications.
> > If it's not good for testpmd, it may not be good for production either.
> 
> The arguments raised so far against the patch have been:
> 
> 1) Performance is reduced.
> Reply: 
> * Of course performance is reduced, but any solution will also 
> reduce performance similary.
> * Performance is reduced but within the acceptable limits set by
> NFV standards. So the performance reduction argument is kind
> of not an issue (in my POV).
> 
> 2) Testpmd is a test application.
> 
> Well, if one would like to avoid XFS corruption or other similar
> results caused by the not possibility of running poll mode testpmd
> (while testing) then he should enable the options (which are disabled
> by default). Moreover, testpmd is an example application used by
> production developers, so it should be integrated to testpmd.
> 
> 
> Does anyone have arguments against the reasoning above ? 

Note: yes the kernel seems to be the proper place to fix this,
however:

1) It'll take some time to fix the kernel to handle the problem.

2) In the meantime, a temporary workaround in DPDK is available.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] testpmd: add nanosleep in main loop
  2017-11-11  3:49     ` Marcelo Tosatti
@ 2017-11-12 23:14       ` Ananyev, Konstantin
  2017-11-13 18:01         ` Marcelo Tosatti
  0 siblings, 1 reply; 14+ messages in thread
From: Ananyev, Konstantin @ 2017-11-12 23:14 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Adrien Mazarguil, dev, Luiz Capitulino, Daniel Bristot de Oliveira



> -----Original Message-----
> From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> Sent: Saturday, November 11, 2017 3:50 AM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; dev@dpdk.org; Luiz Capitulino <lcapitulino@redhat.com>; Daniel Bristot de Oliveira
> <bristot@redhat.com>
> Subject: Re: [dpdk-dev] [PATCH] testpmd: add nanosleep in main loop
> 
> On Fri, Nov 10, 2017 at 10:14:23AM +0000, Ananyev, Konstantin wrote:
> >
> >
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Adrien Mazarguil
> > > Sent: Friday, November 10, 2017 9:12 AM
> > > To: Marcelo Tosatti <mtosatti@redhat.com>
> > > Cc: dev@dpdk.org; Luiz Capitulino <lcapitulino@redhat.com>; Daniel Bristot de Oliveira <bristot@redhat.com>
> > > Subject: Re: [dpdk-dev] [PATCH] testpmd: add nanosleep in main loop
> > >
> > > Hi Marcelo,
> > >
> > > On Fri, Nov 10, 2017 at 04:02:10AM -0200, Marcelo Tosatti wrote:
> > > >
> > > > This patch allows a configurable pair of values to be set, which
> > > > controls
> > > > the frequency and length of a nanosleep call performed at test-pmd's
> > > > iofwd main loop.
> > > >
> > > > The problem is the following: it is necessary to execute code
> > > > on isolated CPUs which is not part of the packet forwarding load.
> > > >
> > > > For example:
> > > >
> > > >  "echo val > /sys/kernel/debug/tracing/buffer_size_kb"
> > > >
> > > > hangs the process, because the DPDK thread has higher
> > > > priority than the workqueue thread which executes the flush from
> > > > CPU local tracebuffer to CPU global trace buffer [the workitem
> > > > in case].
> > > >
> > > > There are more serious issues than the trace-cmd bug, such as XFS
> > > > workitems failing to execute causing filesystem corruption.
> > > >
> > > > To workaround this problem, until a proper kernel
> > > > solution is developed, allow DPDK to nanosleep
> > > > (hopefully with a small enough frequency and interval
> > > > so that the performance is within acceptable levels).
> > >
> > > I understand the need to do something about it, however the nanosleep()
> > > approach seems questionable to me.
> > >
> > > Testpmd's forwarding modes (particularly I/O) are used for benchmarking
> > > purposes by many and are therefore sensitive to change. This code path is
> > > currently free from system calls for that reason and nanosleep() is an
> > > expensive one by definition. Even if optional or called at a low frequency,
> > > the presence of this new code has an impact.
> > >
> > > Since testpmd is a development tool not supposed to run in a production
> > > environment, is there really a need for it to be patched to work around a
> > > (temporary) Linux kernel bug?
> > >
> > > If so, why is I/O the only forwarding mode impacted?
> > >
> > > If it's used in a production environment and such a fix can't wait, have
> > > other workarounds been considered:
> > >
> > > - Replacing testpmd in I/O mode with a physical cable or switch?
> > >
> > > - Using proper options on the kernel command line as described in [1], such
> > >   as isolcpus, rcu_nocbs, nohz_full?
> > >
> > > [1] doc/guides/howto/pvp_reference_benchmark.rst
> >
> >
> > Agree with Adrian here - the patch doesn't fix the problem in any case,
> 
> It does fix the problem as the original message describes the testing.

If the user will run testpmd with different fwd mode
(macfwd, csum, txonly, etc.) - he would hit exactly the same problem.
If the user would run any other of DPDK sample applications (l2fwd, l3fwd, etc.) -
he would hit the same problem again.
If some of DPDK customers have a busy loop inside their production system -
they would hit that problem too.
As I understand - that problem is even not DPDK related - any application that uses
busy loop inside can be affected.
Correct?
So I think the patch doesn't fix the problem, all it does - helps to avoid
particular manifestation of it.

BTW, if it is a generic kernel problem - I suppose there should be some 
record in kernel bugzilla to track it, right?
If so, could you probably provide some reference to it? 

>From other side - testpmd is not a production app - 
as the name implies it is a tool to test PMDs functionality and performance.
Specially iofwd  is sort of synthetic benchmark that allows to measure
highest possible PMD performance.
That's why I think many people (and me too) would prefer to keep it intact
and free from any system calls.
   
If you are that desperate to provide some workaround sepcially for testpmd -
my suggestion would be to introduce new fwd mode here that would call
nanosleep() periodically, while keeping original iofwd mode intact. 

> 
> > while introducing an unnecessary slowdown in testpmd iofwd mode.
> 
> It is not unnecessary: it is a mandatory slowdown for any approach
> which fixes the problem, whether it's in DPDK or not.

dpdk runs on other OSes too (freebsd).
For non-linux users it definetly looks like an unnecessary one. 

> 
> > Please think up some other approach.
> > Konstantin
> 
> What characteristics are you looking for in "some other approach"?
> That DPDK is not interrupted? Impossible.

Even if it has to be interrupted - why this can't be done transparently to the user?
Via some high-priority kernel thread/interrupt or so?  

> 
> See, the thing here is that nanosleep length and frequency are
> controllable, which allows an application developer to tune the values
> _and_ still meet their performance metrics.

Honestly, I can't see how you can force each and every application developer
to start injecting nanosleep() into every busy loop inside their applications.
Not to mention already existing legacy apps.

Konstantin

> 
> I would rather be interested in a robust system (which allows certain
> maintenance tasks to execute on isolated CPUs) with deterministic
> performance meeting defined goals rather than "maximum performance"
> (which is completly meaningless other than for marketing purposes).


> 
> 
> > > > The new parameters are:
> > > >
> > > > *  --delay-hz: sets nanosleep frequency in Hz.
> > > > *  --delay-length: sets nanosleep length in ns.
> > > >
> > > > Results for delay-hz=100,delay-length=10000 (which allows
> > > > the buffer_size_kb change to complete):
> > > >
> > > > Baseline run-1:
> > > > [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 49505, Average:
> > > > 19008.7 ns, StdDev: 2501.0 ns, Quartiles: 17293.0/18330.0/19901.0 ns
> > > >
> > > > Baseline run-2:
> > > > [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 49606, Average:
> > > > 19036.4 ns, StdDev: 2485.2 ns, Quartiles: 17318.0/18349.0/19936.0 ns
> > > >
> > > > Baseline run-3:
> > > > [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 49627, Average:
> > > > 19019.2 ns, StdDev: 2503.7 ns, Quartiles: 17323.0/18355.0/19940.0 ns
> > > >
> > > > ============================
> > > >
> > > > (10.000us, 100HZ)
> > > >
> > > > Run-1:
> > > > [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 7284, Average:
> > > > 20830.6 ns, StdDev: 12023.0 ns, Quartiles: 17309.0/18394.0/20233.0 ns
> > > >
> > > > Run-2:
> > > > [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 6272, Average:
> > > > 20897.1 ns, StdDev: 12057.2 ns, Quartiles: 17389.0/18457.0/20266.0 ns
> > > >
> > > > Run-3:
> > > > [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 4843, Average:
> > > > 20535.2 ns, StdDev: 9827.3 ns, Quartiles: 17389.0/18441.0/20269.0 ns
> > > >
> > > >
> > > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > > >
> > > >
> > > > diff -Nur dpdk-17.08.orig/app/test-pmd/iofwd.c dpdk-17.08/app/test-pmd/iofwd.c
> > > > --- dpdk-17.08.orig/app/test-pmd/iofwd.c	2017-10-30 22:45:37.829492673 -0200
> > > > +++ dpdk-17.08/app/test-pmd/iofwd.c	2017-10-30 22:45:48.321522581 -0200
> > > > @@ -64,9 +64,30 @@
> > > >  #include <rte_ethdev.h>
> > > >  #include <rte_string_fns.h>
> > > >  #include <rte_flow.h>
> > > > +#include <time.h>
> > > >
> > > >  #include "testpmd.h"
> > > >
> > > > +uint32_t nanosleep_interval;
> > > > +
> > > > +static void calc_nanosleep_interval(int hz)
> > > > +{
> > > > +	uint64_t cycles_per_sec = rte_get_timer_hz();
> > > > +	nanosleep_interval = cycles_per_sec/hz;
> > > > +}
> > > > +
> > > > +static void do_nanosleep(void)
> > > > +{
> > > > +	struct timespec req;
> > > > +
> > > > +	req.tv_sec = 0;
> > > > +	req.tv_nsec = nanosleep_length;
> > > > +
> > > > +	nanosleep(&req, NULL);
> > > > +
> > > > +	return;
> > > > +}
> > > > +
> > > >  /*
> > > >   * Forwarding of packets in I/O mode.
> > > >   * Forward packets "as-is".
> > > > @@ -81,6 +102,10 @@
> > > >  	uint16_t nb_tx;
> > > >  	uint32_t retry;
> > > >
> > > > +
> > > > +	if (nanosleep_interval == 0 && nanosleep_frequency > 0)
> > > > +		calc_nanosleep_interval(nanosleep_frequency);
> > > > +
> > > >  #ifdef RTE_TEST_PMD_RECORD_CORE_CYCLES
> > > >  	uint64_t start_tsc;
> > > >  	uint64_t end_tsc;
> > > > @@ -91,6 +116,12 @@
> > > >  	start_tsc = rte_rdtsc();
> > > >  #endif
> > > >
> > > > +	if (nanosleep_frequency > 0 &&
> > > > +	    rte_get_timer_cycles() > fs->next_nanosleep) {
> > > > +		do_nanosleep();
> > > > +		fs->next_nanosleep = rte_get_timer_cycles() + nanosleep_interval;
> > > > +	}
> > > > +
> > > >  	/*
> > > >  	 * Receive a burst of packets and forward them.
> > > >  	 */
> > > > diff -Nur dpdk-17.08.orig/app/test-pmd/parameters.c dpdk-17.08/app/test-pmd/parameters.c
> > > > --- dpdk-17.08.orig/app/test-pmd/parameters.c	2017-10-30 22:45:37.830492676 -0200
> > > > +++ dpdk-17.08/app/test-pmd/parameters.c	2017-10-30 22:46:33.708651912 -0200
> > > > @@ -216,6 +216,8 @@
> > > >  	       "disable print of designated event or all of them.\n");
> > > >  	printf("  --flow-isolate-all: "
> > > >  	       "requests flow API isolated mode on all ports at initialization time.\n");
> > > > +	printf("  --delay-hz: sets nanosleep frequency in Hz.\n");
> > > > +	printf("  --delay-length: sets nanosleep length in ns.\n");
> > > >  }
> > > >
> > > >  #ifdef RTE_LIBRTE_CMDLINE
> > > > @@ -638,7 +640,9 @@
> > > >  		{ "no-rmv-interrupt",		0, 0, 0 },
> > > >  		{ "print-event",		1, 0, 0 },
> > > >  		{ "mask-event",			1, 0, 0 },
> > > > -		{ 0, 0, 0, 0 },
> > > > +		{ "delay-hz",			1, 0, 0 },
> > > > +		{ "delay-length",		1, 0, 0 },
> > > > +	 	{ 0, 0, 0, 0 },
> > > >  	};
> > > >
> > > >  	argvopt = argv;
> > > > @@ -1099,6 +1103,27 @@
> > > >  				else
> > > >  					rte_exit(EXIT_FAILURE, "bad txpkts\n");
> > > >  			}
> > > > +
> > > > +			if (!strcmp(lgopts[opt_idx].name, "delay-hz")) {
> > > > +				int n;
> > > > +
> > > > +				n = atoi(optarg);
> > > > +
> > > > +				if (n < 0)
> > > > +					rte_exit(EXIT_FAILURE, "bad delay-hz\n");
> > > > +				nanosleep_frequency = n;
> > > > +			}
> > > > +
> > > > +			if (!strcmp(lgopts[opt_idx].name, "delay-length")) {
> > > > +				int n;
> > > > +
> > > > +				n = atoi(optarg);
> > > > +
> > > > +				if (n < 0)
> > > > +					rte_exit(EXIT_FAILURE, "bad delay-length\n");
> > > > +				nanosleep_length = n;
> > > > +			}
> > > > +
> > > >  			if (!strcmp(lgopts[opt_idx].name, "no-flush-rx"))
> > > >  				no_flush_rx = 1;
> > > >  			if (!strcmp(lgopts[opt_idx].name, "disable-link-check"))
> > > > diff -Nur dpdk-17.08.orig/app/test-pmd/testpmd.c dpdk-17.08/app/test-pmd/testpmd.c
> > > > --- dpdk-17.08.orig/app/test-pmd/testpmd.c	2017-10-30 22:45:37.829492673 -0200
> > > > +++ dpdk-17.08/app/test-pmd/testpmd.c	2017-10-30 22:45:48.323522591 -0200
> > > > @@ -327,6 +327,13 @@
> > > >
> > > >  #endif
> > > >
> > > > +
> > > > +/* How long to sleep in packet processing */
> > > > +uint32_t nanosleep_length;
> > > > +
> > > > +/* How often to sleep in packet processing */
> > > > +uint32_t nanosleep_frequency;
> > > > +
> > > >  /*
> > > >   * Ethernet device configuration.
> > > >   */
> > > > diff -Nur dpdk-17.08.orig/app/test-pmd/testpmd.h dpdk-17.08/app/test-pmd/testpmd.h
> > > > --- dpdk-17.08.orig/app/test-pmd/testpmd.h	2017-10-30 22:45:37.829492673 -0200
> > > > +++ dpdk-17.08/app/test-pmd/testpmd.h	2017-10-30 22:45:48.323522591 -0200
> > > > @@ -127,6 +127,7 @@
> > > >  	struct pkt_burst_stats rx_burst_stats;
> > > >  	struct pkt_burst_stats tx_burst_stats;
> > > >  #endif
> > > > +	uint64_t next_nanosleep;
> > > >  };
> > > >
> > > >  /** Offload IP checksum in csum forward engine */
> > > > @@ -390,6 +391,9 @@
> > > >  extern lcoreid_t latencystats_lcore_id;
> > > >  #endif
> > > >
> > > > +extern uint32_t nanosleep_length;
> > > > +extern uint32_t nanosleep_frequency;
> > > > +
> > > >  #ifdef RTE_LIBRTE_BITRATE
> > > >  extern lcoreid_t bitrate_lcore_id;
> > > >  extern uint8_t bitrate_enabled;
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > > --
> > > Adrien Mazarguil
> > > 6WIND

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] testpmd: add nanosleep in main loop
  2017-11-12 23:14       ` Ananyev, Konstantin
@ 2017-11-13 18:01         ` Marcelo Tosatti
  0 siblings, 0 replies; 14+ messages in thread
From: Marcelo Tosatti @ 2017-11-13 18:01 UTC (permalink / raw)
  To: Ananyev, Konstantin
  Cc: Adrien Mazarguil, dev, Luiz Capitulino, Daniel Bristot de Oliveira

On Sun, Nov 12, 2017 at 11:14:35PM +0000, Ananyev, Konstantin wrote:
> 
> 
> > -----Original Message-----
> > From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> > Sent: Saturday, November 11, 2017 3:50 AM
> > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; dev@dpdk.org; Luiz Capitulino <lcapitulino@redhat.com>; Daniel Bristot de Oliveira
> > <bristot@redhat.com>
> > Subject: Re: [dpdk-dev] [PATCH] testpmd: add nanosleep in main loop
> > 
> > On Fri, Nov 10, 2017 at 10:14:23AM +0000, Ananyev, Konstantin wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Adrien Mazarguil
> > > > Sent: Friday, November 10, 2017 9:12 AM
> > > > To: Marcelo Tosatti <mtosatti@redhat.com>
> > > > Cc: dev@dpdk.org; Luiz Capitulino <lcapitulino@redhat.com>; Daniel Bristot de Oliveira <bristot@redhat.com>
> > > > Subject: Re: [dpdk-dev] [PATCH] testpmd: add nanosleep in main loop
> > > >
> > > > Hi Marcelo,
> > > >
> > > > On Fri, Nov 10, 2017 at 04:02:10AM -0200, Marcelo Tosatti wrote:
> > > > >
> > > > > This patch allows a configurable pair of values to be set, which
> > > > > controls
> > > > > the frequency and length of a nanosleep call performed at test-pmd's
> > > > > iofwd main loop.
> > > > >
> > > > > The problem is the following: it is necessary to execute code
> > > > > on isolated CPUs which is not part of the packet forwarding load.
> > > > >
> > > > > For example:
> > > > >
> > > > >  "echo val > /sys/kernel/debug/tracing/buffer_size_kb"
> > > > >
> > > > > hangs the process, because the DPDK thread has higher
> > > > > priority than the workqueue thread which executes the flush from
> > > > > CPU local tracebuffer to CPU global trace buffer [the workitem
> > > > > in case].
> > > > >
> > > > > There are more serious issues than the trace-cmd bug, such as XFS
> > > > > workitems failing to execute causing filesystem corruption.
> > > > >
> > > > > To workaround this problem, until a proper kernel
> > > > > solution is developed, allow DPDK to nanosleep
> > > > > (hopefully with a small enough frequency and interval
> > > > > so that the performance is within acceptable levels).
> > > >
> > > > I understand the need to do something about it, however the nanosleep()
> > > > approach seems questionable to me.
> > > >
> > > > Testpmd's forwarding modes (particularly I/O) are used for benchmarking
> > > > purposes by many and are therefore sensitive to change. This code path is
> > > > currently free from system calls for that reason and nanosleep() is an
> > > > expensive one by definition. Even if optional or called at a low frequency,
> > > > the presence of this new code has an impact.
> > > >
> > > > Since testpmd is a development tool not supposed to run in a production
> > > > environment, is there really a need for it to be patched to work around a
> > > > (temporary) Linux kernel bug?
> > > >
> > > > If so, why is I/O the only forwarding mode impacted?
> > > >
> > > > If it's used in a production environment and such a fix can't wait, have
> > > > other workarounds been considered:
> > > >
> > > > - Replacing testpmd in I/O mode with a physical cable or switch?
> > > >
> > > > - Using proper options on the kernel command line as described in [1], such
> > > >   as isolcpus, rcu_nocbs, nohz_full?
> > > >
> > > > [1] doc/guides/howto/pvp_reference_benchmark.rst
> > >
> > >
> > > Agree with Adrian here - the patch doesn't fix the problem in any case,
> > 
> > It does fix the problem as the original message describes the testing.
> 
> If the user will run testpmd with different fwd mode
> (macfwd, csum, txonly, etc.) - he would hit exactly the same problem.
> If the user would run any other of DPDK sample applications (l2fwd, l3fwd, etc.) -
> he would hit the same problem again.
> If some of DPDK customers have a busy loop inside their production system -
> they would hit that problem too.

Yes. Are you suggesting any improvement to the patch?

> As I understand - that problem is even not DPDK related - any application that uses
> busy loop inside can be affected.
> Correct?

Yes.

> So I think the patch doesn't fix the problem, all it does - helps to avoid
> particular manifestation of it.

To fix the problem in l3fwd, the same logic must be applied to l3fwd.
To fix the problem in DPDK customer app busy loop, the application must
be fixed.

As long as the same logic is applied to the particular application,
then that application is _fixed_.

> BTW, if it is a generic kernel problem - I suppose there should be some 
> record in kernel bugzilla to track it, right?
> If so, could you probably provide some reference to it? 

It is debatable.

> >From other side - testpmd is not a production app - 
> as the name implies it is a tool to test PMDs functionality and performance.
> Specially iofwd  is sort of synthetic benchmark that allows to measure
> highest possible PMD performance.
> That's why I think many people (and me too) would prefer to keep it intact
> and free from any system calls.

It is free from system calls. The system calls are only enabled if 

> If you are that desperate to provide some workaround sepcially for testpmd -
> my suggestion would be to introduce new fwd mode here that would call
> nanosleep() periodically, while keeping original iofwd mode intact. 

This is what the patch does already. Original iofwd mode is intact.

> > > while introducing an unnecessary slowdown in testpmd iofwd mode.
> > 
> > It is not unnecessary: it is a mandatory slowdown for any approach
> > which fixes the problem, whether it's in DPDK or not.
> 
> dpdk runs on other OSes too (freebsd).
> For non-linux users it definetly looks like an unnecessary one.

I bet FreeBSD also has threads and if configured to work in 
fully isolated mode, 

> 
> > 
> > > Please think up some other approach.
> > > Konstantin
> > 
> > What characteristics are you looking for in "some other approach"?
> > That DPDK is not interrupted? Impossible.
> 
> Even if it has to be interrupted - why this can't be done transparently to the user?
> Via some high-priority kernel thread/interrupt or so?  

It can be. 


> > See, the thing here is that nanosleep length and frequency are
> > controllable, which allows an application developer to tune the values
> > _and_ still meet their performance metrics.
> 
> Honestly, I can't see how you can force each and every application developer
> to start injecting nanosleep() into every busy loop inside their applications.
> Not to mention already existing legacy apps.
> 
> Konstantin

Ok, the example has been posted, it might be useful for someone.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-11-13 22:55 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-10  6:02 [PATCH] testpmd: add nanosleep in main loop Marcelo Tosatti
2017-11-10  9:12 ` Adrien Mazarguil
2017-11-10 10:13   ` Daniel Bristot de Oliveira
2017-11-10 10:14   ` Ananyev, Konstantin
2017-11-10 10:42     ` Daniel Bristot de Oliveira
2017-11-10 11:14       ` Bruce Richardson
2017-11-10 13:51         ` Luiz Capitulino
2017-11-11  3:59           ` Marcelo Tosatti
2017-11-11  4:01             ` Marcelo Tosatti
2017-11-11  3:54         ` Marcelo Tosatti
2017-11-11  3:49     ` Marcelo Tosatti
2017-11-12 23:14       ` Ananyev, Konstantin
2017-11-13 18:01         ` Marcelo Tosatti
2017-11-11  3:45   ` Marcelo Tosatti

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.