Re: [linus:master] [x86] adfcf4231b: blogbench.read_score -10.9% regression

From: Feng Tang <feng.tang@intel.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: kernel test robot <yujie.liu@intel.com>,
	Frank Denis <j@pureftpd.org>, <oe-lkp@lists.linux.dev>,
	<lkp@intel.com>, <linux-kernel@vger.kernel.org>,
	<ying.huang@intel.com>, <fengwei.yin@intel.com>
Subject: Re: [linus:master] [x86] adfcf4231b: blogbench.read_score -10.9% regression
Date: Tue, 9 May 2023 20:24:47 +0800	[thread overview]
Message-ID: <ZFo7j/M1suY/Ey5M@feng-clx> (raw)
In-Reply-To: <CAHk-=wjMz_=WyfQMx1ebofeq+w6Cr_kRcJ1Xc=D6iKc4My16bQ@mail.gmail.com>

Hi Linus,

On Thu, May 04, 2023 at 11:56:12AM -0700, Linus Torvalds wrote:
> On Thu, May 4, 2023 at 10:39 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > And I suspect that the 'blogbench.read_score' thing probably does
> > mostly big reads, and really really liked the cacheline optimizations
> > of 'rep movs' even when it was slow in other things.
> 
> Oh, and I went to look at the blogbench code, and it's a
> *horrifically* bad benchmark for this case.
> 
> It may actually be a good benchmark for other things - the "look up
> random filenames" etc might be somewhat indicative of real performance
> on some loads.
> 
> But the read() system call part is very much *not* good.
> 
> It looks like the reason blogbench likes 'rep movs' is the traditional
> reason why memory copy benchmarks tend to like uncached copies: it
> looks like the benchmark doesn't actually even look at the data it
> reads.
> 
> It's a common pattern, but it's a bad pattern. It's not very different
> from doing CPU benchmarks using an empty loop:
> 
>      for (int i = 0; i < 1000000; i++) /*nothing*/;
> 
> and then being surprised when the compiler optimized the loop away
> because it doesn't actually *do* anything.
> 
> Of course, unlike the empty loop, the kernel will still do the reads,
> but the kernel will do the reads with the assumption that the result
> matters.
> 
> And for blogbench, that is *doubly* not the case.
> 
> Because it looks like blogbench does a loop over
> 
>         read_dummy_file(file_name);
> 
> which will open that file, and then loop over
> 
>         #define READ_CHUNK_SIZE 65536
>         ...
>             static char garbage[READ_CHUNK_SIZE];
>         ..
>         loop:
>                 if ((readen = read(fd, garbage, sizeof garbage))
> 
> and that variable is very appropriately named indeed. It is very much garbage.
> 
> So it really looks like what is going on is that you have
> 
>  (a) a threaded application (looks like N=100 for reads by default)
> 
>  (b) doing 'read()' calls into a *shared* buffer in parallel
> 
>  (c) and never using the buffer for anything else
> 
> which means that a memory copy that does non-temporal writes is going
> to look artificially really good, because nobody wants to ever see the
> end result.
> 
> Now, the "nobody wants to ever see the end result", might actually be
> at least _partially_ valid in some historical setting, if you were to
> use DMA to then push things out to a network. Back in the olden days,
> that was actually a valid reason to use non-temporal writes, because
> the DMA would have to invalidate the caches anyway.
> 
> So that part is wrong, but isn't necessarily *horrifically* wrong.
> It's just bad.
> 
> But using one shared destination buffer for N readers is *completely*
> bogus. Any cache-lines would bounce around like crazy for the
> (pointless) sharing. I'd like to call it "false sharing", but it's not
> really false: it's intentionally just using one big shared buffer as a
> sink for all these read() calls.
> 
> End result: I think the benchmark scores for reading are entirely
> pointless random numbers, and trying to optimize the kernel for this
> benchmark is exactly the wrong thing to do.
> 
> The benchmark doesn't actually do anything half-way valid.
> 
> At a *minimum* that benchmark should have different threads using
> different read() buffers.

We tried a debug patch which allocates a dedicated buffer for
each reader thread, run it on the same Cacade Lake platform, and
the regression is _gone_, after the noise of cache false sharing is
reduced.

20f3337d350c4e1b adfcf4231b8cbc2d9c1e7bfaa96 
---------------- --------------------------- 
   2119572 ±  2%      +1.7%    2155704 ±  2%  blogbench.read_score

   3700996 ±  9%      -7.4%    3427562 ± 16%  perf-stat.ps.dTLB-load-misses
 8.171e+09           +25.3%  1.024e+10        perf-stat.ps.dTLB-loads
    431224 ±  9%     -13.1%     374903 ± 12%  perf-stat.ps.dTLB-store-misses
 2.399e+09           +87.5%  4.497e+09        perf-stat.ps.dTLB-stores
   8495507 ±  7%      -8.2%    7794738        perf-stat.ps.iTLB-load-misses

      3.07            -3.1        0.00        perf-profile.self.cycles-pp.copy_user_enhanced_fast_string
      0.00            +3.4        3.40        perf-profile.self.cycles-pp.copy_user_generic_unrolled

We also run the original blogbench binary on a IceLake platform
(which has both ERMS and FSRM), and there is almost no peformance
change for the 2 commits.

Please let us know if you need more profiling info or want to run
more tests. 

Thanks,
Feng

---
The debug patch for blobbench is:

diff --git a/src/blogbench.h b/src/blogbench.h
index bf538a8..d99d5bb 100644
--- a/src/blogbench.h
+++ b/src/blogbench.h
@@ -178,6 +178,8 @@ void *rewriter(void * const fodder);
 void *reader(void * const fodder);
 void *commenter(void * const fodder);
 
+extern pthread_key_t read_buf_key;
+
 #include "globals.h"
 
 #endif
diff --git a/src/helpers.c b/src/helpers.c
index 4c17e73..85631ec 100644
--- a/src/helpers.c
+++ b/src/helpers.c
@@ -107,15 +107,18 @@ int create_atomic_file(const char * const file_name,
 
 int read_dummy_file(const char * const file_name)
 {
-    static char garbage[READ_CHUNK_SIZE];
     ssize_t readen;
+    void * read_buf;
     int fd;
     
     if ((fd = open(file_name, O_RDONLY)) == -1) {
         return -1;
     }
+
+    read_buf = pthread_getspecific(read_buf_key);
+
     do {
-        if ((readen = read(fd, garbage, sizeof garbage))
+        if ((readen = read(fd, read_buf, READ_CHUNK_SIZE))
             < (ssize_t) 0) {
             if (errno == EINTR) {
                 continue;
diff --git a/src/process.c b/src/process.c
index a83a980..04f1411 100644
--- a/src/process.c
+++ b/src/process.c
@@ -58,9 +58,13 @@ static int wait_rewriters(void)
     return 0;    
 }
 
+pthread_key_t read_buf_key = 0;
+
 static int spawn_readers(void)
 {
     unsigned int t = nb_readers;
+
+    pthread_key_create(&read_buf_key, NULL);
     
     do {
         t--;
diff --git a/src/reader.c b/src/reader.c
index 91bab7a..2c0efdb 100644
--- a/src/reader.c
+++ b/src/reader.c
@@ -66,6 +66,11 @@ static int read_random_blog(void)
 void *reader(void * const fodder)
 {    
     (void) fodder;
+
+    void *read_buf;
+
+    read_buf = malloc(READ_CHUNK_SIZE);
+    pthread_setspecific(read_buf_key, read_buf);
     
     do {
         if (read_random_blog() != 0) {
@@ -75,6 +80,8 @@ void *reader(void * const fodder)
         usleep(USLEEP_READER);
 #endif
     } while (stop == (sig_atomic_t) 0);
+
+    free(read_buf);
     
     return NULL;
 }
  
> I suspect it might be a good idea to also touch some of the data,
> because that would be the normal situation (even with old-style
> zero-copy DMA you'd probably not do raw file data and push it out to
> the network unmodified).
> 
> Because otherwise you will always find that objectively bad memory
> copies will do better than a good memory copy that makes sense.
> 
> Anyway, that was a very long-winded way of saying that I will not use
> that benchmark as a reason to touch the kernel "copy_to_user()"
> implementation.
> 
> Of course, that's not to say that this might not be a real regression
> on real loads, but I'd want to see those other numbers.
> 
> It looks like this is an old benchmark that hasn't been touched in
> years (just going by the github activity), but I'm cc'ing the author
> anyway, and will add a pointer to this email as a github issue. Maybe
> the developer cares, maybe he doesn't, but no reason not to at least
> mention this and see if maybe the benchmark can be improved to at
> least use a thread buffers for the reads.
> 
> (The same issues are true for the writing side too, of course, but the
> writing side has less parallelism and a shared *source* is not the
> same kind of problematic from a cache pattern standpoint, so probably
> isn't going to have anything like the same effect from any memcpy
> implementation).
> 
>                   Linus