All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
To: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>,
	"dev@dpdk.org" <dev@dpdk.org>,
	Luiz Capitulino <lcapitulino@redhat.com>,
	"Daniel Bristot de Oliveira" <bristot@redhat.com>
Subject: Re: [PATCH] testpmd: add nanosleep in main loop
Date: Sun, 12 Nov 2017 23:14:35 +0000	[thread overview]
Message-ID: <2601191342CEEE43887BDE71AB9772585FABC8DB@irsmsx105.ger.corp.intel.com> (raw)
In-Reply-To: <20171111034959.GB23577@amt.cnet>



> -----Original Message-----
> From: Marcelo Tosatti [mailto:mtosatti@redhat.com]
> Sent: Saturday, November 11, 2017 3:50 AM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; dev@dpdk.org; Luiz Capitulino <lcapitulino@redhat.com>; Daniel Bristot de Oliveira
> <bristot@redhat.com>
> Subject: Re: [dpdk-dev] [PATCH] testpmd: add nanosleep in main loop
> 
> On Fri, Nov 10, 2017 at 10:14:23AM +0000, Ananyev, Konstantin wrote:
> >
> >
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Adrien Mazarguil
> > > Sent: Friday, November 10, 2017 9:12 AM
> > > To: Marcelo Tosatti <mtosatti@redhat.com>
> > > Cc: dev@dpdk.org; Luiz Capitulino <lcapitulino@redhat.com>; Daniel Bristot de Oliveira <bristot@redhat.com>
> > > Subject: Re: [dpdk-dev] [PATCH] testpmd: add nanosleep in main loop
> > >
> > > Hi Marcelo,
> > >
> > > On Fri, Nov 10, 2017 at 04:02:10AM -0200, Marcelo Tosatti wrote:
> > > >
> > > > This patch allows a configurable pair of values to be set, which
> > > > controls
> > > > the frequency and length of a nanosleep call performed at test-pmd's
> > > > iofwd main loop.
> > > >
> > > > The problem is the following: it is necessary to execute code
> > > > on isolated CPUs which is not part of the packet forwarding load.
> > > >
> > > > For example:
> > > >
> > > >  "echo val > /sys/kernel/debug/tracing/buffer_size_kb"
> > > >
> > > > hangs the process, because the DPDK thread has higher
> > > > priority than the workqueue thread which executes the flush from
> > > > CPU local tracebuffer to CPU global trace buffer [the workitem
> > > > in case].
> > > >
> > > > There are more serious issues than the trace-cmd bug, such as XFS
> > > > workitems failing to execute causing filesystem corruption.
> > > >
> > > > To workaround this problem, until a proper kernel
> > > > solution is developed, allow DPDK to nanosleep
> > > > (hopefully with a small enough frequency and interval
> > > > so that the performance is within acceptable levels).
> > >
> > > I understand the need to do something about it, however the nanosleep()
> > > approach seems questionable to me.
> > >
> > > Testpmd's forwarding modes (particularly I/O) are used for benchmarking
> > > purposes by many and are therefore sensitive to change. This code path is
> > > currently free from system calls for that reason and nanosleep() is an
> > > expensive one by definition. Even if optional or called at a low frequency,
> > > the presence of this new code has an impact.
> > >
> > > Since testpmd is a development tool not supposed to run in a production
> > > environment, is there really a need for it to be patched to work around a
> > > (temporary) Linux kernel bug?
> > >
> > > If so, why is I/O the only forwarding mode impacted?
> > >
> > > If it's used in a production environment and such a fix can't wait, have
> > > other workarounds been considered:
> > >
> > > - Replacing testpmd in I/O mode with a physical cable or switch?
> > >
> > > - Using proper options on the kernel command line as described in [1], such
> > >   as isolcpus, rcu_nocbs, nohz_full?
> > >
> > > [1] doc/guides/howto/pvp_reference_benchmark.rst
> >
> >
> > Agree with Adrian here - the patch doesn't fix the problem in any case,
> 
> It does fix the problem as the original message describes the testing.

If the user will run testpmd with different fwd mode
(macfwd, csum, txonly, etc.) - he would hit exactly the same problem.
If the user would run any other of DPDK sample applications (l2fwd, l3fwd, etc.) -
he would hit the same problem again.
If some of DPDK customers have a busy loop inside their production system -
they would hit that problem too.
As I understand - that problem is even not DPDK related - any application that uses
busy loop inside can be affected.
Correct?
So I think the patch doesn't fix the problem, all it does - helps to avoid
particular manifestation of it.

BTW, if it is a generic kernel problem - I suppose there should be some 
record in kernel bugzilla to track it, right?
If so, could you probably provide some reference to it? 

>From other side - testpmd is not a production app - 
as the name implies it is a tool to test PMDs functionality and performance.
Specially iofwd  is sort of synthetic benchmark that allows to measure
highest possible PMD performance.
That's why I think many people (and me too) would prefer to keep it intact
and free from any system calls.
   
If you are that desperate to provide some workaround sepcially for testpmd -
my suggestion would be to introduce new fwd mode here that would call
nanosleep() periodically, while keeping original iofwd mode intact. 

> 
> > while introducing an unnecessary slowdown in testpmd iofwd mode.
> 
> It is not unnecessary: it is a mandatory slowdown for any approach
> which fixes the problem, whether it's in DPDK or not.

dpdk runs on other OSes too (freebsd).
For non-linux users it definetly looks like an unnecessary one. 

> 
> > Please think up some other approach.
> > Konstantin
> 
> What characteristics are you looking for in "some other approach"?
> That DPDK is not interrupted? Impossible.

Even if it has to be interrupted - why this can't be done transparently to the user?
Via some high-priority kernel thread/interrupt or so?  

> 
> See, the thing here is that nanosleep length and frequency are
> controllable, which allows an application developer to tune the values
> _and_ still meet their performance metrics.

Honestly, I can't see how you can force each and every application developer
to start injecting nanosleep() into every busy loop inside their applications.
Not to mention already existing legacy apps.

Konstantin

> 
> I would rather be interested in a robust system (which allows certain
> maintenance tasks to execute on isolated CPUs) with deterministic
> performance meeting defined goals rather than "maximum performance"
> (which is completly meaningless other than for marketing purposes).


> 
> 
> > > > The new parameters are:
> > > >
> > > > *  --delay-hz: sets nanosleep frequency in Hz.
> > > > *  --delay-length: sets nanosleep length in ns.
> > > >
> > > > Results for delay-hz=100,delay-length=10000 (which allows
> > > > the buffer_size_kb change to complete):
> > > >
> > > > Baseline run-1:
> > > > [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 49505, Average:
> > > > 19008.7 ns, StdDev: 2501.0 ns, Quartiles: 17293.0/18330.0/19901.0 ns
> > > >
> > > > Baseline run-2:
> > > > [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 49606, Average:
> > > > 19036.4 ns, StdDev: 2485.2 ns, Quartiles: 17318.0/18349.0/19936.0 ns
> > > >
> > > > Baseline run-3:
> > > > [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 49627, Average:
> > > > 19019.2 ns, StdDev: 2503.7 ns, Quartiles: 17323.0/18355.0/19940.0 ns
> > > >
> > > > ============================
> > > >
> > > > (10.000us, 100HZ)
> > > >
> > > > Run-1:
> > > > [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 7284, Average:
> > > > 20830.6 ns, StdDev: 12023.0 ns, Quartiles: 17309.0/18394.0/20233.0 ns
> > > >
> > > > Run-2:
> > > > [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 6272, Average:
> > > > 20897.1 ns, StdDev: 12057.2 ns, Quartiles: 17389.0/18457.0/20266.0 ns
> > > >
> > > > Run-3:
> > > > [Histogram port 0 to port 1 at rate 2.3 Mpps] Samples: 4843, Average:
> > > > 20535.2 ns, StdDev: 9827.3 ns, Quartiles: 17389.0/18441.0/20269.0 ns
> > > >
> > > >
> > > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > > >
> > > >
> > > > diff -Nur dpdk-17.08.orig/app/test-pmd/iofwd.c dpdk-17.08/app/test-pmd/iofwd.c
> > > > --- dpdk-17.08.orig/app/test-pmd/iofwd.c	2017-10-30 22:45:37.829492673 -0200
> > > > +++ dpdk-17.08/app/test-pmd/iofwd.c	2017-10-30 22:45:48.321522581 -0200
> > > > @@ -64,9 +64,30 @@
> > > >  #include <rte_ethdev.h>
> > > >  #include <rte_string_fns.h>
> > > >  #include <rte_flow.h>
> > > > +#include <time.h>
> > > >
> > > >  #include "testpmd.h"
> > > >
> > > > +uint32_t nanosleep_interval;
> > > > +
> > > > +static void calc_nanosleep_interval(int hz)
> > > > +{
> > > > +	uint64_t cycles_per_sec = rte_get_timer_hz();
> > > > +	nanosleep_interval = cycles_per_sec/hz;
> > > > +}
> > > > +
> > > > +static void do_nanosleep(void)
> > > > +{
> > > > +	struct timespec req;
> > > > +
> > > > +	req.tv_sec = 0;
> > > > +	req.tv_nsec = nanosleep_length;
> > > > +
> > > > +	nanosleep(&req, NULL);
> > > > +
> > > > +	return;
> > > > +}
> > > > +
> > > >  /*
> > > >   * Forwarding of packets in I/O mode.
> > > >   * Forward packets "as-is".
> > > > @@ -81,6 +102,10 @@
> > > >  	uint16_t nb_tx;
> > > >  	uint32_t retry;
> > > >
> > > > +
> > > > +	if (nanosleep_interval == 0 && nanosleep_frequency > 0)
> > > > +		calc_nanosleep_interval(nanosleep_frequency);
> > > > +
> > > >  #ifdef RTE_TEST_PMD_RECORD_CORE_CYCLES
> > > >  	uint64_t start_tsc;
> > > >  	uint64_t end_tsc;
> > > > @@ -91,6 +116,12 @@
> > > >  	start_tsc = rte_rdtsc();
> > > >  #endif
> > > >
> > > > +	if (nanosleep_frequency > 0 &&
> > > > +	    rte_get_timer_cycles() > fs->next_nanosleep) {
> > > > +		do_nanosleep();
> > > > +		fs->next_nanosleep = rte_get_timer_cycles() + nanosleep_interval;
> > > > +	}
> > > > +
> > > >  	/*
> > > >  	 * Receive a burst of packets and forward them.
> > > >  	 */
> > > > diff -Nur dpdk-17.08.orig/app/test-pmd/parameters.c dpdk-17.08/app/test-pmd/parameters.c
> > > > --- dpdk-17.08.orig/app/test-pmd/parameters.c	2017-10-30 22:45:37.830492676 -0200
> > > > +++ dpdk-17.08/app/test-pmd/parameters.c	2017-10-30 22:46:33.708651912 -0200
> > > > @@ -216,6 +216,8 @@
> > > >  	       "disable print of designated event or all of them.\n");
> > > >  	printf("  --flow-isolate-all: "
> > > >  	       "requests flow API isolated mode on all ports at initialization time.\n");
> > > > +	printf("  --delay-hz: sets nanosleep frequency in Hz.\n");
> > > > +	printf("  --delay-length: sets nanosleep length in ns.\n");
> > > >  }
> > > >
> > > >  #ifdef RTE_LIBRTE_CMDLINE
> > > > @@ -638,7 +640,9 @@
> > > >  		{ "no-rmv-interrupt",		0, 0, 0 },
> > > >  		{ "print-event",		1, 0, 0 },
> > > >  		{ "mask-event",			1, 0, 0 },
> > > > -		{ 0, 0, 0, 0 },
> > > > +		{ "delay-hz",			1, 0, 0 },
> > > > +		{ "delay-length",		1, 0, 0 },
> > > > +	 	{ 0, 0, 0, 0 },
> > > >  	};
> > > >
> > > >  	argvopt = argv;
> > > > @@ -1099,6 +1103,27 @@
> > > >  				else
> > > >  					rte_exit(EXIT_FAILURE, "bad txpkts\n");
> > > >  			}
> > > > +
> > > > +			if (!strcmp(lgopts[opt_idx].name, "delay-hz")) {
> > > > +				int n;
> > > > +
> > > > +				n = atoi(optarg);
> > > > +
> > > > +				if (n < 0)
> > > > +					rte_exit(EXIT_FAILURE, "bad delay-hz\n");
> > > > +				nanosleep_frequency = n;
> > > > +			}
> > > > +
> > > > +			if (!strcmp(lgopts[opt_idx].name, "delay-length")) {
> > > > +				int n;
> > > > +
> > > > +				n = atoi(optarg);
> > > > +
> > > > +				if (n < 0)
> > > > +					rte_exit(EXIT_FAILURE, "bad delay-length\n");
> > > > +				nanosleep_length = n;
> > > > +			}
> > > > +
> > > >  			if (!strcmp(lgopts[opt_idx].name, "no-flush-rx"))
> > > >  				no_flush_rx = 1;
> > > >  			if (!strcmp(lgopts[opt_idx].name, "disable-link-check"))
> > > > diff -Nur dpdk-17.08.orig/app/test-pmd/testpmd.c dpdk-17.08/app/test-pmd/testpmd.c
> > > > --- dpdk-17.08.orig/app/test-pmd/testpmd.c	2017-10-30 22:45:37.829492673 -0200
> > > > +++ dpdk-17.08/app/test-pmd/testpmd.c	2017-10-30 22:45:48.323522591 -0200
> > > > @@ -327,6 +327,13 @@
> > > >
> > > >  #endif
> > > >
> > > > +
> > > > +/* How long to sleep in packet processing */
> > > > +uint32_t nanosleep_length;
> > > > +
> > > > +/* How often to sleep in packet processing */
> > > > +uint32_t nanosleep_frequency;
> > > > +
> > > >  /*
> > > >   * Ethernet device configuration.
> > > >   */
> > > > diff -Nur dpdk-17.08.orig/app/test-pmd/testpmd.h dpdk-17.08/app/test-pmd/testpmd.h
> > > > --- dpdk-17.08.orig/app/test-pmd/testpmd.h	2017-10-30 22:45:37.829492673 -0200
> > > > +++ dpdk-17.08/app/test-pmd/testpmd.h	2017-10-30 22:45:48.323522591 -0200
> > > > @@ -127,6 +127,7 @@
> > > >  	struct pkt_burst_stats rx_burst_stats;
> > > >  	struct pkt_burst_stats tx_burst_stats;
> > > >  #endif
> > > > +	uint64_t next_nanosleep;
> > > >  };
> > > >
> > > >  /** Offload IP checksum in csum forward engine */
> > > > @@ -390,6 +391,9 @@
> > > >  extern lcoreid_t latencystats_lcore_id;
> > > >  #endif
> > > >
> > > > +extern uint32_t nanosleep_length;
> > > > +extern uint32_t nanosleep_frequency;
> > > > +
> > > >  #ifdef RTE_LIBRTE_BITRATE
> > > >  extern lcoreid_t bitrate_lcore_id;
> > > >  extern uint8_t bitrate_enabled;
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > > --
> > > Adrien Mazarguil
> > > 6WIND

  reply	other threads:[~2017-11-12 23:14 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-11-10  6:02 [PATCH] testpmd: add nanosleep in main loop Marcelo Tosatti
2017-11-10  9:12 ` Adrien Mazarguil
2017-11-10 10:13   ` Daniel Bristot de Oliveira
2017-11-10 10:14   ` Ananyev, Konstantin
2017-11-10 10:42     ` Daniel Bristot de Oliveira
2017-11-10 11:14       ` Bruce Richardson
2017-11-10 13:51         ` Luiz Capitulino
2017-11-11  3:59           ` Marcelo Tosatti
2017-11-11  4:01             ` Marcelo Tosatti
2017-11-11  3:54         ` Marcelo Tosatti
2017-11-11  3:49     ` Marcelo Tosatti
2017-11-12 23:14       ` Ananyev, Konstantin [this message]
2017-11-13 18:01         ` Marcelo Tosatti
2017-11-11  3:45   ` Marcelo Tosatti

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2601191342CEEE43887BDE71AB9772585FABC8DB@irsmsx105.ger.corp.intel.com \
    --to=konstantin.ananyev@intel.com \
    --cc=adrien.mazarguil@6wind.com \
    --cc=bristot@redhat.com \
    --cc=dev@dpdk.org \
    --cc=lcapitulino@redhat.com \
    --cc=mtosatti@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.