Re: [PATCH v3 0/3] ntb_perf: add new 'latency' test set

From: Jon Mason <jdmason@kudzu.us>
To: Serge Semin <fancer.lancer@gmail.com>
Cc: Alexander Fomichev <fomichev.ru@gmail.com>,
	Dave Jiang <dave.jiang@intel.com>,
	ntb@lists.linux.dev,  linux@yadro.com,
	Allen Hubbe <allenbh@gmail.com>,
	Guo Zhengkui <guozhengkui@vivo.com>,
	 Alexander Fomichev <a.fomichev@yadro.com>
Subject: Re: [PATCH v3 0/3] ntb_perf: add new 'latency' test set
Date: Tue, 9 Aug 2022 11:43:20 -0400	[thread overview]
Message-ID: <CAPoiz9y5_6=N0tRo2n22bFpFb2GEcbfC8L8a4J+EZjeEegTbiA@mail.gmail.com> (raw)
In-Reply-To: <20220622203648.jo6raa4h57g24el2@mobilestation>

On Wed, Jun 22, 2022 at 4:36 PM Serge Semin <fancer.lancer@gmail.com> wrote:
>
> On Wed, Jun 22, 2022 at 12:44:57PM +0300, Alexander Fomichev wrote:
> > Hi Serge,
> >
> > Thank you for the very detailed comments.
> >
> > On Tue, Jun 21, 2022 at 05:05:37PM +0300, Serge Semin wrote:
> > >
> > > Please also note, there is a special test-script, which rely on the
> > > certain test drivers semantic:
> > > tools/testing/selftests/ntb/ntb_test.sh
> > > It looks like the suggested patches don't change the already defined
> > > ntb_perf DebugFS interface, but still may cause the script to fail if the
> > > latency on the particular system will get measured too high. So should
> > > we have the latency-part in a separate driver, the script won't get
> > > affected by it. If it is required the script could be updated accordingly
> > > taking the new driver specifics into account.
> > >
>
> > As described in the cover commit message, the resulting test is fully
> > backward compatible. I mean that if we don't fiddle with the new sysfs
> > entries, then no latency measures are performed, and the test works as
> > it did before the patch set.
>
> sysfs? Did you mean debugfs?
> Anyway the DebugFS interface will be indeed backward compatible, but
> functionally the performance test won't be the same. AFAICS at the
> very least the burst-latency test is executed by default together with
> the standard performance test. It may cause ntb_test.sh regressions on
> the systems (the test will fail if the latency is too high), which
> haven't had any problem before.
>
> > Also, I plan to make changes to "ntb_test.sh" script accordingly, after
> > this patch set is merged.
>
> Good. It will be much easier to do should you have the latency test
> implemented as a separate driver.
>
> >
> > > Regarding the tests implementation. As I see it failing the latency
> > > measurements if they're performed with the too few retries isn't a good
> > > idea. Alexander, you said that normally performing 1000 retries is
> > > enough to get the latency with a good precision, but the test driver
> > > returns an error if the number of retries is less than 20. So what
> > > happens between 20 and 1000? The tests get passed, but the results
> > > aren't accurate or what? If so then why don't the test fail in the
> > > case of 30 iterations too? IMO as long as you don't define the strong
> > > accuracy criteria, the failure condition shouldn't be determined by
> > > the number of iterations. So if I were you I would execute the latency
> > > tests with the specified "lat_time_ms" duration and printed a warning
> > > if the number of iterations turned to be too low (100, 200?) most
> > > likely causing to have inaccurate results, but still would calculate
> > > the latency from the determined numbers (even if there were only one
> > > iteration performed).
> > >
> > Reasonable. I can easily change this part.
> >
> > > The main issue is that after applying all the changes the ntb_perf
> > > driver will get extended greatly with three additional sub-tests
> > > and thus will loose its coherency. It gets to be obvious after
> > > the patch 2 and 3 applied, which introduce additional client-server
> > > semantic and imply allocating their-own private data. All of that
> > > makes the code much harder to read and breaks the driver specialization.
> > >
> > > The latency tests as them-self are very useful though. But it would be
> > > much better to have them implemented in a separate driver
> > > "ntb_latency" or something.
> > >
>
> > The whole 'latency' part relies on 'ntb_perf' infrastructure.
>
> Yeah, it's very handy, isn't it? =)
> Originally it has been created in a way so the perf-test would be
> portable across all the supported NTB HW types: local- and peer-based
> MW xlate address setup (Intel/AMD/Switchtec NTB HW vs IDT NTB HW),
> Scratchpad and Messages capable devices (Intel/AMD/Switchtec NTB HW vs
> IDT NTB HW). My idea was to provide a reference of how a portable NTB
> application could be designed. (It took some hard time to debug that
> part on the Intel/AMD hardware since nobody of the current maintainers
> had an access to one at that moment.) On the next step I was going to
> move the communication part of the ntb_perf driver to a separate
> kernel module as a communication library (which could be used for the
> inter-domains basic communications on top of the DB+Spad/MSG
> interface), but alas didn't find a time to get to work on it.
>
> > Moreover,
> > the first patch adds only one meaningful function.  Thus separatin theg
> > 'latency' part will make me copy a lot of code. As a compromise, I can
> > offer to put latency-related code into a separate .c file but leave the
> > whole test in a single module. That should increase readability and
> > eliminate code duplication.
>
> What would be much better if you detached the infrastructure part into
> a separate module, which could be afterwards used by the ntb_perf and
> your ntb_latency drivers.
>
> What would be the best if you created it as a kernel library with a
> well-defined interface, which could be used not only by the test
> drivers, but by any kernel application (most importantly by the NTB
> transport driver) for the basic communications (like MW xlate address
> exchange, portable MW xlate address setup, etc).
>
> >
> > > I am very sorry to spilling it out at this stage. I should have done
> > > it on v1 or v2. Anyway it's up to the driver/subsystem maintainers
> > > (Dave, Jon) to decide whether the suggested update is suitable despite
> > > of all my thoughts.
> > >
>
> > Let's call for the third opinion.
>
> Ok.
>
> -Sergey

The merge window is closing soon. I'm going to drop the v3 that I
currently have and let this cook for a little longer.  Let's do a v4
with all of the feedback that has been agreed on, and then we can take
a look again.

Thanks,
Jon

> >
> >
> > --
> > Regards,
> >   Alexander