Re: [Qemu-devel] [PATCH v3 11/11] translate-all: add tb hash bucket info to 'info jit' dump

From: "Emilio G. Cota" <cota@braap.org>
To: Richard Henderson <rth@twiddle.net>
Cc: "QEMU Developers" <qemu-devel@nongnu.org>,
	"MTTCG Devel" <mttcg@greensocs.com>,
	"Alex Bennée" <alex.bennee@linaro.org>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Peter Crosthwaite" <crosthwaite.peter@gmail.com>,
	"Peter Maydell" <peter.maydell@linaro.org>,
	"Sergey Fedorov" <serge.fdrv@gmail.com>
Subject: Re: [Qemu-devel] [PATCH v3 11/11] translate-all: add tb hash bucket info to 'info jit' dump
Date: Fri, 22 Apr 2016 19:57:38 -0400	[thread overview]
Message-ID: <20160422235738.GA2410@flamenco> (raw)
In-Reply-To: <571A82B8.5080908@twiddle.net>

On Fri, Apr 22, 2016 at 12:59:52 -0700, Richard Henderson wrote:
> FWIW, so that I could get an idea of how the stats change as we improve the
> hashing, I inserted the attachment 1 patch between patches 5 and 6, and with
> attachment 2 attempting to fix the accounting for patches 9 and 10.

For qht, I dislike the approach of reporting "avg chain" per-element,
instead of per-bucket. Performance for a bucket whose entries are
all valid is virtually the same as that of a bucket that only
has one valid element; thus, with per-bucket reporting, we'd say that
the chain lenght is 1 in both cases, i.e. "perfect". With per-element
reporting, we'd report 4 (on a 64-bit host, since that's the value of
QHT_BUCKET_ENTRIES) when the bucket is full, which IMO gives the
wrong idea (users would think they're in trouble, when they're not).

Using the avg-bucket-chain metric you can test how good the hashing is.
For instance, the metric is 1.01 for xxhash with phys_pc, pc and flags
(i.e. func5), and 1.21 if func5 takes only a valid phys_pc (the other two are 0).

I think reporting fully empty buckets as well as the longest chain
(of buckets for qht) in addition to this metric is a good idea, though.

> For booting an alpha kernel to login prompt:
> 
> Before hashing changes (@5/11)
> 
> TB count             175363/671088
> TB invalidate count  3996
> TB hash buckets      31731/32768
> TB hash avg chain    5.289 max=59
> 
> After xxhash patch (@7/11)
> 
> TB hash buckets      32582/32768
> TB hash avg chain    5.260 max=18
> 
> So far so good!
> 
> After qht patches (@11/11)
> 
> TB hash buckets      94360/131072
> TB hash avg chain    1.774 max=8
> 
> Do note that those last numbers are off: 1.774 avg * 94360 used buckets =
> 167394 total entries, which is far from 171367, the correct number of total
> entries.

If those numbers are off, then either this
    assert(hinfo.used_entries ==
           tcg_ctx.tb_ctx.nb_tbs - tcg_ctx.tb_ctx.tb_phys_invalidate_count);
should trigger, or the accounting isn't right.

Another option is that the "TB count - invalidate_count" is different
for each test you ran. I think this is what's going on, otherwise we couldn't
explain why the first report ("before 5/11") is also "wrong":

  5.289*31731=167825.259

Only the second report ("after 7/11") seems good (taking into account
lack of precision of just 3 decimals):
  5.26*32582=171381.32 ~= 171367
which leads me to believe that you've used the TB and invalidate
counts from that test.

I just tested your patches (on an ARM bootup) and the assert doesn't trigger,
and the stats are spot on for "after 11/11":

TB count            643610/2684354
TB hash buckets     369534/524288
TB hash avg chain   1.729 max=8
TB flush count      0
TB invalidate count 4718

1.729*369534=638924.286, which is ~= 643610-4718 = 638892.

> I'm tempted to pull over gcc's non-chaining hash table implementation
> (libiberty/hashtab.c, still gplv2+) and compare...

You can try, but I think performance wouldn't be great, because
the comparison function would be called way too often due to the
ht using open addressing. The problem there is not only the comparisons
themselves, but the all the cache lines needed to read the fields of
the comparison. I haven't tested libiberty's htable but I did test
the htable in concurrencykit[1], which also uses open addressing.

With ck's ht, performance was not good when booting ARM: IIRC ~30% of
runtime was spent on tb_cmp(); I also added the full hash to each TB so
that it would be compared first, but it didn't make a difference since
the delay was due to loading the cache line (I saw this with perf(1)'s
annotated code, which showed that ~80% of the time spent in tb_cmp()
was in performing the first load of the TB's fields).

This led me to a design that had buckets with a small set of
hash & pointer pairs, all in the same cache line as the head (then
I discovered somebody else had thought of this, and that's why there's
a link to the CLHT paper in qht.c).

BTW I tested ck's htable also because of a requirement we have for MTTCG,
which is to support lock-free concurrent lookups. AFAICT libiberty's ht
doesn't support this, so it might be a bit faster than ck's.

Thanks,

		Emilio

[1] http://concurrencykit.org/
    More info on their htable implementation here:
    http://backtrace.io/blog/blog/2015/03/13/workload-specialization/