* [Qemu-devel] [PATCH v2 0/3] tcg: allocate TB structs preceding translated code @ 2017-06-05 22:49 Emilio G. Cota 2017-06-05 22:49 ` [Qemu-devel] [PATCH v2 1/3] compiler: define QEMU_CACHELINE_SIZE Emilio G. Cota ` (2 more replies) 0 siblings, 3 replies; 18+ messages in thread From: Emilio G. Cota @ 2017-06-05 22:49 UTC (permalink / raw) To: qemu-devel Cc: Richard Henderson, alex.bennee, Peter Maydell, Paolo Bonzini, Pranith Kumar v1: https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg00780.html Changes from v1: - Define QEMU_CACHELINE_SIZE as suggested by Richard. We try to get the value from the machine running configure, but if we fail we use some reasonable defaults. In any case the value can be overriden from --extra-cflags at configure time, which is particularly useful when cross-compiling. - Use QEMU_CACHELINE_SIZE where appropriate, namely in tests/. - In the unlikely case that code_gen_buffer_size / avg block / 8 is 0, then set tbs_size to 64K instead of just 1K, as suggested by Richard. This patchset applies on top of rth's tcg-next branch (pull-tcg-20170605 tag). NB. Apologies if some emails sent to me bounced during the last couple of days; my domain name (braap.org) was down. Thanks, Emilio ^ permalink raw reply [flat|nested] 18+ messages in thread
* [Qemu-devel] [PATCH v2 1/3] compiler: define QEMU_CACHELINE_SIZE 2017-06-05 22:49 [Qemu-devel] [PATCH v2 0/3] tcg: allocate TB structs preceding translated code Emilio G. Cota @ 2017-06-05 22:49 ` Emilio G. Cota 2017-06-06 5:39 ` Pranith Kumar 2017-06-05 22:49 ` [Qemu-devel] [PATCH v2 2/3] tests: use QEMU_CACHELINE_SIZE instead of hard-coding it Emilio G. Cota 2017-06-05 22:49 ` [Qemu-devel] [PATCH v2 3/3] tcg: allocate TB structs before the corresponding translated code Emilio G. Cota 2 siblings, 1 reply; 18+ messages in thread From: Emilio G. Cota @ 2017-06-05 22:49 UTC (permalink / raw) To: qemu-devel Cc: Richard Henderson, alex.bennee, Peter Maydell, Paolo Bonzini, Pranith Kumar This is a constant used as a hint for padding structs to hopefully avoid false cache line sharing. The constant can be set at configure time by defining QEMU_CACHELINE_SIZE via --extra-cflags. If not set there, we try to obtain the value from the machine running the configure script. If we fail, we default to reasonable values, i.e. 128 bytes for ppc64 and 64 bytes for all others. Note: the configure script only picks up the cache line size when run on Linux hosts because I have no other platforms (e.g. Windows, BSD's) to test on. Signed-off-by: Emilio G. Cota <cota@braap.org> --- configure | 38 ++++++++++++++++++++++++++++++++++++++ include/qemu/compiler.h | 17 +++++++++++++++++ 2 files changed, 55 insertions(+) diff --git a/configure b/configure index 13e040d..6a68cb2 100755 --- a/configure +++ b/configure @@ -4832,6 +4832,41 @@ EOF fi fi +# Find out the size of a cache line on the host +# TODO: support more platforms +cat > $TMPC<<EOF +#ifdef __linux__ + +#include <stdio.h> + +#define SYSFS "/sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size" + +int main(int argc, char *argv[]) +{ + unsigned int size; + FILE *fp; + + fp = fopen(SYSFS, "r"); + if (fp == NULL) { + return -1; + } + if (!fscanf(fp, "%u", &size)) { + return -1; + } + return size; +} +#else +#error Cannot find host cache line size +#endif +EOF + +host_cacheline_size=0 +if compile_prog "" "" ; then + ./$TMPE + host_cacheline_size=$? +fi + + ########################################## # check for _Static_assert() @@ -5284,6 +5319,9 @@ fi if test "$bigendian" = "yes" ; then echo "HOST_WORDS_BIGENDIAN=y" >> $config_host_mak fi +if test "$host_cacheline_size" -gt 0 ; then + echo "HOST_CACHELINE_SIZE=$host_cacheline_size" >> $config_host_mak +fi if test "$mingw32" = "yes" ; then echo "CONFIG_WIN32=y" >> $config_host_mak rc_version=$(cat $source_path/VERSION) diff --git a/include/qemu/compiler.h b/include/qemu/compiler.h index 340e5fd..178d831 100644 --- a/include/qemu/compiler.h +++ b/include/qemu/compiler.h @@ -40,6 +40,23 @@ # define QEMU_PACKED __attribute__((packed)) #endif +/* + * Cache line size of the host. Can be overriden. + * Note that this is just a compile-time hint to hopefully avoid false sharing + * of cache lines; code must be correct regardless of the constant's value. + */ +#ifndef QEMU_CACHELINE_SIZE +# ifdef HOST_CACHELINE_SIZE +# define QEMU_CACHELINE_SIZE HOST_CACHELINE_SIZE +# else +# if defined(__powerpc64__) +# define QEMU_CACHELINE_SIZE 128 +# else +# define QEMU_CACHELINE_SIZE 64 +# endif +# endif +#endif + #define QEMU_ALIGNED(X) __attribute__((aligned(X))) #ifndef glue -- 2.7.4 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] [PATCH v2 1/3] compiler: define QEMU_CACHELINE_SIZE 2017-06-05 22:49 ` [Qemu-devel] [PATCH v2 1/3] compiler: define QEMU_CACHELINE_SIZE Emilio G. Cota @ 2017-06-06 5:39 ` Pranith Kumar 2017-06-06 8:18 ` Richard Henderson 2017-06-06 16:11 ` Emilio G. Cota 0 siblings, 2 replies; 18+ messages in thread From: Pranith Kumar @ 2017-06-06 5:39 UTC (permalink / raw) To: Emilio G. Cota Cc: qemu-devel, Richard Henderson, Alex Bennée, Peter Maydell, Paolo Bonzini On Mon, Jun 5, 2017 at 6:49 PM, Emilio G. Cota <cota@braap.org> wrote: > This is a constant used as a hint for padding structs to hopefully avoid > false cache line sharing. > > The constant can be set at configure time by defining QEMU_CACHELINE_SIZE > via --extra-cflags. If not set there, we try to obtain the value from > the machine running the configure script. If we fail, we default to > reasonable values, i.e. 128 bytes for ppc64 and 64 bytes for all others. > > Note: the configure script only picks up the cache line size when run > on Linux hosts because I have no other platforms (e.g. Windows, BSD's) > to test on. > > Signed-off-by: Emilio G. Cota <cota@braap.org> > --- > configure | 38 ++++++++++++++++++++++++++++++++++++++ > include/qemu/compiler.h | 17 +++++++++++++++++ > 2 files changed, 55 insertions(+) > > diff --git a/configure b/configure > index 13e040d..6a68cb2 100755 > --- a/configure > +++ b/configure > @@ -4832,6 +4832,41 @@ EOF > fi > fi > > +# Find out the size of a cache line on the host > +# TODO: support more platforms > +cat > $TMPC<<EOF > +#ifdef __linux__ > + > +#include <stdio.h> > + > +#define SYSFS "/sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size" > + > +int main(int argc, char *argv[]) > +{ > + unsigned int size; > + FILE *fp; > + > + fp = fopen(SYSFS, "r"); > + if (fp == NULL) { > + return -1; > + } > + if (!fscanf(fp, "%u", &size)) { > + return -1; > + } > + return size; > +} > +#else > +#error Cannot find host cache line size > +#endif > +EOF Is there any reason not to use sysconf(_SC_LEVEL1_DCACHE_LINESIZE)? Thanks, -- Pranith ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] [PATCH v2 1/3] compiler: define QEMU_CACHELINE_SIZE 2017-06-06 5:39 ` Pranith Kumar @ 2017-06-06 8:18 ` Richard Henderson 2017-06-06 16:11 ` Emilio G. Cota 1 sibling, 0 replies; 18+ messages in thread From: Richard Henderson @ 2017-06-06 8:18 UTC (permalink / raw) To: Pranith Kumar, Emilio G. Cota Cc: qemu-devel, Alex Bennée, Peter Maydell, Paolo Bonzini On 06/05/2017 10:39 PM, Pranith Kumar wrote: > Is there any reason not to use sysconf(_SC_LEVEL1_DCACHE_LINESIZE)? That's an excellent idea. In fact... see reply to 3/3. r~ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] [PATCH v2 1/3] compiler: define QEMU_CACHELINE_SIZE 2017-06-06 5:39 ` Pranith Kumar 2017-06-06 8:18 ` Richard Henderson @ 2017-06-06 16:11 ` Emilio G. Cota 2017-06-06 17:39 ` Richard Henderson 1 sibling, 1 reply; 18+ messages in thread From: Emilio G. Cota @ 2017-06-06 16:11 UTC (permalink / raw) To: Pranith Kumar Cc: qemu-devel, Richard Henderson, Alex Bennée, Peter Maydell, Paolo Bonzini On Tue, Jun 06, 2017 at 01:39:45 -0400, Pranith Kumar wrote: > On Mon, Jun 5, 2017 at 6:49 PM, Emilio G. Cota <cota@braap.org> wrote: > > This is a constant used as a hint for padding structs to hopefully avoid > > false cache line sharing. > > > > The constant can be set at configure time by defining QEMU_CACHELINE_SIZE > > via --extra-cflags. If not set there, we try to obtain the value from > > the machine running the configure script. If we fail, we default to > > reasonable values, i.e. 128 bytes for ppc64 and 64 bytes for all others. (snip) > Is there any reason not to use sysconf(_SC_LEVEL1_DCACHE_LINESIZE)? I tried using sysconf, but it doesn't work on the PowerPC machine I have access to (it returns 0). It might be a machine-specific thing though-I don't know. Here's the machine's `uname -a': Linux gcc2-power8.osuosl.org 3.10.0-514.10.2.el7.ppc64le #1 SMP Fri Mar \ 3 16:16:38 GMT 2017 ppc64le ppc64le ppc64le GNU/Linux E. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] [PATCH v2 1/3] compiler: define QEMU_CACHELINE_SIZE 2017-06-06 16:11 ` Emilio G. Cota @ 2017-06-06 17:39 ` Richard Henderson 2017-06-06 20:28 ` Geert Martin Ijewski 0 siblings, 1 reply; 18+ messages in thread From: Richard Henderson @ 2017-06-06 17:39 UTC (permalink / raw) To: Emilio G. Cota, Pranith Kumar Cc: qemu-devel, Alex Bennée, Peter Maydell, Paolo Bonzini On 06/06/2017 09:11 AM, Emilio G. Cota wrote: > On Tue, Jun 06, 2017 at 01:39:45 -0400, Pranith Kumar wrote: >> On Mon, Jun 5, 2017 at 6:49 PM, Emilio G. Cota <cota@braap.org> wrote: >>> This is a constant used as a hint for padding structs to hopefully avoid >>> false cache line sharing. >>> >>> The constant can be set at configure time by defining QEMU_CACHELINE_SIZE >>> via --extra-cflags. If not set there, we try to obtain the value from >>> the machine running the configure script. If we fail, we default to >>> reasonable values, i.e. 128 bytes for ppc64 and 64 bytes for all others. > (snip) >> Is there any reason not to use sysconf(_SC_LEVEL1_DCACHE_LINESIZE)? > > I tried using sysconf, but it doesn't work on the PowerPC machine I have > access to (it returns 0). It might be a machine-specific thing though-I > don't know. Here's the machine's `uname -a': > Linux gcc2-power8.osuosl.org 3.10.0-514.10.2.el7.ppc64le #1 SMP Fri Mar \ > 3 16:16:38 GMT 2017 ppc64le ppc64le ppc64le GNU/Linux Well that's unfortunate. Doing some digging, the kernel has provided the info to userland via elf auxv data since the beginning of time (aka initial git repository build), but glibc still does not export that information properly for ppc. For ppc, you can get what we want from qemu_getauxval(AT_ICACHEBSIZE). Indeed, we already have 4 different system dependent methods for determining the icache size in tcg/ppc/tcg-target.inc.c. So what I think we ought to do is create a new util/cachesize.c like so: unsigned qemu_icache_linesize = 64; unsigned qemu_dcache_linesize = 64; static void init_icache_data(void) { #ifdef _SC_LEVEL1_ICACHE_LINESIZE { long x = sysconf(_SC_LEVEL1_ICACHE_LINESIZE); if (x > 0) { qemu_icache_linesize = x; return; } } #endif #ifdef AT_ICACHEBSIZE { unsigned long x = qemu_getauxval(AT_ICACHEBSIZE); if (x > 0) { qemu_icache_linesize = x; return; } } #endif // Other system specific methods. } static void init_dcache_data(void) { // Similarly. } static void __attribute__((constructor)) init_cache_data(void) { init_icache_data(); init_dcache_data(); } In particular, I think you want to be padding to the icache linesize rather than the dcache linesize since what we're attempting is to avoid writable data in the icache. r~ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] [PATCH v2 1/3] compiler: define QEMU_CACHELINE_SIZE 2017-06-06 17:39 ` Richard Henderson @ 2017-06-06 20:28 ` Geert Martin Ijewski 2017-06-06 21:38 ` Emilio G. Cota 0 siblings, 1 reply; 18+ messages in thread From: Geert Martin Ijewski @ 2017-06-06 20:28 UTC (permalink / raw) To: Richard Henderson, Emilio G. Cota, Pranith Kumar Cc: Peter Maydell, Alex Bennée, qemu-devel, Paolo Bonzini Am 06.06.2017 um 19:39 schrieb Richard Henderson: > On 06/06/2017 09:11 AM, Emilio G. Cota wrote: >> On Tue, Jun 06, 2017 at 01:39:45 -0400, Pranith Kumar wrote: >>> On Mon, Jun 5, 2017 at 6:49 PM, Emilio G. Cota <cota@braap.org> wrote: >>>> This is a constant used as a hint for padding structs to hopefully >>>> avoid >>>> false cache line sharing. >>>> >>>> The constant can be set at configure time by defining >>>> QEMU_CACHELINE_SIZE >>>> via --extra-cflags. If not set there, we try to obtain the value from >>>> the machine running the configure script. If we fail, we default to >>>> reasonable values, i.e. 128 bytes for ppc64 and 64 bytes for all >>>> others. >> (snip) >>> Is there any reason not to use sysconf(_SC_LEVEL1_DCACHE_LINESIZE)? >> >> I tried using sysconf, but it doesn't work on the PowerPC machine I have >> access to (it returns 0). It might be a machine-specific thing though-I >> don't know. Here's the machine's `uname -a': >> Linux gcc2-power8.osuosl.org 3.10.0-514.10.2.el7.ppc64le #1 SMP Fri >> Mar \ >> 3 16:16:38 GMT 2017 ppc64le ppc64le ppc64le GNU/Linux > > Well that's unfortunate. > > Doing some digging, the kernel has provided the info to userland via elf > auxv data since the beginning of time (aka initial git repository > build), but glibc still does not export that information properly for ppc. > > For ppc, you can get what we want from qemu_getauxval(AT_ICACHEBSIZE). > Indeed, we already have 4 different system dependent methods for > determining the icache size in tcg/ppc/tcg-target.inc.c. > > So what I think we ought to do is create a new util/cachesize.c like so: > > unsigned qemu_icache_linesize = 64; > unsigned qemu_dcache_linesize = 64; > > static void init_icache_data(void) > { > #ifdef _SC_LEVEL1_ICACHE_LINESIZE > { > long x = sysconf(_SC_LEVEL1_ICACHE_LINESIZE); > if (x > 0) { > qemu_icache_linesize = x; > return; > } > } > #endif > #ifdef AT_ICACHEBSIZE > { > unsigned long x = qemu_getauxval(AT_ICACHEBSIZE); > if (x > 0) { > qemu_icache_linesize = x; > return; > } > } > #endif > // Other system specific methods. On a fully patched Windows 10 with an i5-4690 this code works for me (TM): #ifdef _WIN32 { DWORD bufferSize = 0; if (!GetLogicalProcessorInformation(0, &bufferSize) && GetLastError() == ERROR_INSUFFICIENT_BUFFER) { PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION)g_malloc0(bufferSize); if (GetLogicalProcessorInformation(buffer, &bufferSize)) { size_t i = 0, numOfProcessors = bufferSize / sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION); for (; i < numOfProcessors; i++) { if (buffer[i].Relationship == RelationCache && buffer[i].Cache.Level == 1 && ( buffer[i].Cache.Type == CacheUnified || buffer[i].Cache.Type == CacheInstruction) ) { qemu_icache_linesize = buffer[i].Cache.LineSize; break; } } } g_free(buffer); } } #endif I don't particularly like that stair of ifs style, so I guess if I were to do a proper patch this should become a function. > } > > static void init_dcache_data(void) > { > // Similarly. The code from above, just s/CacheInstruction/CacheData/ and s/qemu_icache/qemu_dcache/ > } > > static void __attribute__((constructor)) init_cache_data(void) > { > init_icache_data(); > init_dcache_data(); > } > > In particular, I think you want to be padding to the icache linesize > rather than the dcache linesize since what we're attempting is to avoid > writable data in the icache. > > > r~ > > To quote from the documentation: "RelationCache: [... snip ...] Windows Server 2003: This value is not supported until Windows Server 2003 with SP1 and Windows XP Professional x64 Edition." -- https://msdn.microsoft.com/en-us/library/windows/desktop/ms686694(v=vs.85).aspx I'm not sure if that is considered a problem, as both systems aren't supported anymore for almost 2 years now. Geert ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] [PATCH v2 1/3] compiler: define QEMU_CACHELINE_SIZE 2017-06-06 20:28 ` Geert Martin Ijewski @ 2017-06-06 21:38 ` Emilio G. Cota 2017-06-06 22:01 ` Geert Martin Ijewski 0 siblings, 1 reply; 18+ messages in thread From: Emilio G. Cota @ 2017-06-06 21:38 UTC (permalink / raw) To: Geert Martin Ijewski Cc: Richard Henderson, Pranith Kumar, Peter Maydell, Alex Bennée, qemu-devel, Paolo Bonzini On Tue, Jun 06, 2017 at 22:28:23 +0200, Geert Martin Ijewski wrote: > On a fully patched Windows 10 with an i5-4690 this code works for me (TM): Thanks! Can you please test this? Emilio --- #include "qemu/osdep.h" #include <windows.h> static unsigned int linesize_win(PROCESSOR_CACHE_TYPE type) { PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buf; DWORD size = 0; unsigned int ret = 0; BOOL success; size_t n; size_t i; success = GetLogicalProcessorInformation(0, &size); if (success || GetLastError() != ERROR_INSUFFICIENT_BUF) { return 0; } buf = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION)g_malloc0(size); if (!GetLogicalProcessorInformation(buf, &size)) { goto out; } n = size / sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION); for (i = 0; i < n; i++) { if (buf[i].Relationship == RelationCache && buf[i].Cache.Level == 1 && (buf[i].Cache.Type == CacheUnified || buf[i].Cache.Type == type)) { ret = buf[i].Cache.LineSize; break; } } out: g_free(buf); return ret; } linesize_win(CacheInstruction); linesize_win(CacheData); ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] [PATCH v2 1/3] compiler: define QEMU_CACHELINE_SIZE 2017-06-06 21:38 ` Emilio G. Cota @ 2017-06-06 22:01 ` Geert Martin Ijewski 0 siblings, 0 replies; 18+ messages in thread From: Geert Martin Ijewski @ 2017-06-06 22:01 UTC (permalink / raw) To: Emilio G. Cota Cc: Richard Henderson, Pranith Kumar, Peter Maydell, Alex Bennée, qemu-devel, Paolo Bonzini Am 06.06.2017 um 23:38 schrieb Emilio G. Cota: > On Tue, Jun 06, 2017 at 22:28:23 +0200, Geert Martin Ijewski wrote: >> On a fully patched Windows 10 with an i5-4690 this code works for me (TM): > > Thanks! > Can you please test this? > > Emilio > --- > #include "qemu/osdep.h" > #include <windows.h> unnecassary as it's already included by qemu/osdep.h -> sysemu/os-win32.h > > static unsigned int linesize_win(PROCESSOR_CACHE_TYPE type) > { > PSYSTEM_LOGICAL_PROCESSOR_INFORMATION buf; > DWORD size = 0; > unsigned int ret = 0; > BOOL success; > size_t n; > size_t i; > > success = GetLogicalProcessorInformation(0, &size); > if (success || GetLastError() != ERROR_INSUFFICIENT_BUF) { > return 0; > } > buf = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION)g_malloc0(size); > if (!GetLogicalProcessorInformation(buf, &size)) { > goto out; > } > > n = size / sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION); > for (i = 0; i < n; i++) { > if (buf[i].Relationship == RelationCache && > buf[i].Cache.Level == 1 && > (buf[i].Cache.Type == CacheUnified || > buf[i].Cache.Type == type)) { > ret = buf[i].Cache.LineSize; > break; > } > } > out: > g_free(buf); > return ret; > } > > linesize_win(CacheInstruction); > linesize_win(CacheData); > > Yes, that works. Tested-by: Geert Martin Ijewski <gm.ijewski@web.de> ^ permalink raw reply [flat|nested] 18+ messages in thread
* [Qemu-devel] [PATCH v2 2/3] tests: use QEMU_CACHELINE_SIZE instead of hard-coding it 2017-06-05 22:49 [Qemu-devel] [PATCH v2 0/3] tcg: allocate TB structs preceding translated code Emilio G. Cota 2017-06-05 22:49 ` [Qemu-devel] [PATCH v2 1/3] compiler: define QEMU_CACHELINE_SIZE Emilio G. Cota @ 2017-06-05 22:49 ` Emilio G. Cota 2017-06-06 5:40 ` Pranith Kumar 2017-06-05 22:49 ` [Qemu-devel] [PATCH v2 3/3] tcg: allocate TB structs before the corresponding translated code Emilio G. Cota 2 siblings, 1 reply; 18+ messages in thread From: Emilio G. Cota @ 2017-06-05 22:49 UTC (permalink / raw) To: qemu-devel Cc: Richard Henderson, alex.bennee, Peter Maydell, Paolo Bonzini, Pranith Kumar Signed-off-by: Emilio G. Cota <cota@braap.org> --- tests/atomic_add-bench.c | 4 ++-- tests/qht-bench.c | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/tests/atomic_add-bench.c b/tests/atomic_add-bench.c index caa1e8e..c219109 100644 --- a/tests/atomic_add-bench.c +++ b/tests/atomic_add-bench.c @@ -5,11 +5,11 @@ struct thread_info { uint64_t r; -} QEMU_ALIGNED(64); +} QEMU_ALIGNED(QEMU_CACHELINE_SIZE); struct count { unsigned long val; -} QEMU_ALIGNED(64); +} QEMU_ALIGNED(QEMU_CACHELINE_SIZE); static QemuThread *threads; static struct thread_info *th_info; diff --git a/tests/qht-bench.c b/tests/qht-bench.c index 2afa09d..3f4b5eb 100644 --- a/tests/qht-bench.c +++ b/tests/qht-bench.c @@ -28,7 +28,7 @@ struct thread_info { uint64_t r; bool write_op; /* writes alternate between insertions and removals */ bool resize_down; -} QEMU_ALIGNED(64); /* avoid false sharing among threads */ +} QEMU_ALIGNED(QEMU_CACHELINE_SIZE); /* avoid false sharing among threads */ static struct qht ht; static QemuThread *rw_threads; -- 2.7.4 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] [PATCH v2 2/3] tests: use QEMU_CACHELINE_SIZE instead of hard-coding it 2017-06-05 22:49 ` [Qemu-devel] [PATCH v2 2/3] tests: use QEMU_CACHELINE_SIZE instead of hard-coding it Emilio G. Cota @ 2017-06-06 5:40 ` Pranith Kumar 0 siblings, 0 replies; 18+ messages in thread From: Pranith Kumar @ 2017-06-06 5:40 UTC (permalink / raw) To: Emilio G. Cota Cc: qemu-devel, Richard Henderson, Alex Bennée, Peter Maydell, Paolo Bonzini On Mon, Jun 5, 2017 at 6:49 PM, Emilio G. Cota <cota@braap.org> wrote: > Signed-off-by: Emilio G. Cota <cota@braap.org> Reviewed-by: Pranith Kumar <bobby.prani@gmail.com> > --- > tests/atomic_add-bench.c | 4 ++-- > tests/qht-bench.c | 2 +- > 2 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/tests/atomic_add-bench.c b/tests/atomic_add-bench.c > index caa1e8e..c219109 100644 > --- a/tests/atomic_add-bench.c > +++ b/tests/atomic_add-bench.c > @@ -5,11 +5,11 @@ > > struct thread_info { > uint64_t r; > -} QEMU_ALIGNED(64); > +} QEMU_ALIGNED(QEMU_CACHELINE_SIZE); > > struct count { > unsigned long val; > -} QEMU_ALIGNED(64); > +} QEMU_ALIGNED(QEMU_CACHELINE_SIZE); > > static QemuThread *threads; > static struct thread_info *th_info; > diff --git a/tests/qht-bench.c b/tests/qht-bench.c > index 2afa09d..3f4b5eb 100644 > --- a/tests/qht-bench.c > +++ b/tests/qht-bench.c > @@ -28,7 +28,7 @@ struct thread_info { > uint64_t r; > bool write_op; /* writes alternate between insertions and removals */ > bool resize_down; > -} QEMU_ALIGNED(64); /* avoid false sharing among threads */ > +} QEMU_ALIGNED(QEMU_CACHELINE_SIZE); /* avoid false sharing among threads */ > > static struct qht ht; > static QemuThread *rw_threads; > -- > 2.7.4 > -- Pranith ^ permalink raw reply [flat|nested] 18+ messages in thread
* [Qemu-devel] [PATCH v2 3/3] tcg: allocate TB structs before the corresponding translated code 2017-06-05 22:49 [Qemu-devel] [PATCH v2 0/3] tcg: allocate TB structs preceding translated code Emilio G. Cota 2017-06-05 22:49 ` [Qemu-devel] [PATCH v2 1/3] compiler: define QEMU_CACHELINE_SIZE Emilio G. Cota 2017-06-05 22:49 ` [Qemu-devel] [PATCH v2 2/3] tests: use QEMU_CACHELINE_SIZE instead of hard-coding it Emilio G. Cota @ 2017-06-05 22:49 ` Emilio G. Cota 2017-06-06 5:36 ` Pranith Kumar 2017-06-06 8:24 ` Richard Henderson 2 siblings, 2 replies; 18+ messages in thread From: Emilio G. Cota @ 2017-06-05 22:49 UTC (permalink / raw) To: qemu-devel Cc: Richard Henderson, alex.bennee, Peter Maydell, Paolo Bonzini, Pranith Kumar Allocating an arbitrarily-sized array of tbs results in either (a) a lot of memory wasted or (b) unnecessary flushes of the code cache when we run out of TB structs in the array. An obvious solution would be to just malloc a TB struct when needed, and keep the TB array as an array of pointers (recall that tb_find_pc() needs the TB array to run in O(log n)). Perhaps a better solution, which is implemented in this patch, is to allocate TB's right before the translated code they describe. This results in some memory waste due to padding to have code and TBs in separate cache lines--for instance, I measured 4.7% of padding in the used portion of code_gen_buffer when booting aarch64 Linux on a host with 64-byte cache lines. However, it can allow for optimizations in some host architectures, since TCG backends could safely assume that the TB and the corresponding translated code are very close to each other in memory. See this message by rth for a detailed explanation: https://lists.gnu.org/archive/html/qemu-devel/2017-03/msg05172.html Subject: Re: GSoC 2017 Proposal: TCG performance enhancements Message-ID: <1e67644b-4b30-887e-d329-1848e94c9484@twiddle.net> Suggested-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Emilio G. Cota <cota@braap.org> --- include/exec/exec-all.h | 2 +- include/exec/tb-context.h | 3 ++- tcg/tcg.c | 16 ++++++++++++++++ tcg/tcg.h | 2 +- translate-all.c | 37 ++++++++++++++++++++++--------------- 5 files changed, 42 insertions(+), 18 deletions(-) diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h index 87ae10b..00c0f43 100644 --- a/include/exec/exec-all.h +++ b/include/exec/exec-all.h @@ -363,7 +363,7 @@ struct TranslationBlock { */ uintptr_t jmp_list_next[2]; uintptr_t jmp_list_first; -}; +} QEMU_ALIGNED(QEMU_CACHELINE_SIZE); void tb_free(TranslationBlock *tb); void tb_flush(CPUState *cpu); diff --git a/include/exec/tb-context.h b/include/exec/tb-context.h index c7f17f2..25c2afe 100644 --- a/include/exec/tb-context.h +++ b/include/exec/tb-context.h @@ -31,8 +31,9 @@ typedef struct TBContext TBContext; struct TBContext { - TranslationBlock *tbs; + TranslationBlock **tbs; struct qht htable; + size_t tbs_size; int nb_tbs; /* any access to the tbs or the page table must use this lock */ QemuMutex tb_lock; diff --git a/tcg/tcg.c b/tcg/tcg.c index 564292f..f657c51 100644 --- a/tcg/tcg.c +++ b/tcg/tcg.c @@ -383,6 +383,22 @@ void tcg_context_init(TCGContext *s) } } +/* + * Allocate TBs right before their corresponding translated code, making + * sure that TBs and code are on different cache lines. + */ +TranslationBlock *tcg_tb_alloc(TCGContext *s) +{ + void *aligned; + + aligned = (void *)ROUND_UP((uintptr_t)s->code_gen_ptr, QEMU_CACHELINE_SIZE); + if (unlikely(aligned + sizeof(TranslationBlock) > s->code_gen_highwater)) { + return NULL; + } + s->code_gen_ptr += aligned - s->code_gen_ptr + sizeof(TranslationBlock); + return aligned; +} + void tcg_prologue_init(TCGContext *s) { size_t prologue_size, total_size; diff --git a/tcg/tcg.h b/tcg/tcg.h index 5ec48d1..9e37722 100644 --- a/tcg/tcg.h +++ b/tcg/tcg.h @@ -697,7 +697,6 @@ struct TCGContext { here, because there's too much arithmetic throughout that relies on addition and subtraction working on bytes. Rely on the GCC extension that allows arithmetic on void*. */ - int code_gen_max_blocks; void *code_gen_prologue; void *code_gen_epilogue; void *code_gen_buffer; @@ -756,6 +755,7 @@ static inline bool tcg_op_buf_full(void) /* tb_lock must be held for tcg_malloc_internal. */ void *tcg_malloc_internal(TCGContext *s, int size); void tcg_pool_reset(TCGContext *s); +TranslationBlock *tcg_tb_alloc(TCGContext *s); void tb_lock(void); void tb_unlock(void); diff --git a/translate-all.c b/translate-all.c index b3ee876..0eb9d13 100644 --- a/translate-all.c +++ b/translate-all.c @@ -781,12 +781,13 @@ static inline void code_gen_alloc(size_t tb_size) exit(1); } - /* Estimate a good size for the number of TBs we can support. We - still haven't deducted the prologue from the buffer size here, - but that's minimal and won't affect the estimate much. */ - tcg_ctx.code_gen_max_blocks - = tcg_ctx.code_gen_buffer_size / CODE_GEN_AVG_BLOCK_SIZE; - tcg_ctx.tb_ctx.tbs = g_new(TranslationBlock, tcg_ctx.code_gen_max_blocks); + /* size this conservatively -- realloc later if needed */ + tcg_ctx.tb_ctx.tbs_size = + tcg_ctx.code_gen_buffer_size / CODE_GEN_AVG_BLOCK_SIZE / 8; + if (unlikely(!tcg_ctx.tb_ctx.tbs_size)) { + tcg_ctx.tb_ctx.tbs_size = 64 * 1024; + } + tcg_ctx.tb_ctx.tbs = g_new(TranslationBlock *, tcg_ctx.tb_ctx.tbs_size); qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock); } @@ -828,13 +829,20 @@ bool tcg_enabled(void) static TranslationBlock *tb_alloc(target_ulong pc) { TranslationBlock *tb; + TBContext *ctx; assert_tb_locked(); - if (tcg_ctx.tb_ctx.nb_tbs >= tcg_ctx.code_gen_max_blocks) { + tb = tcg_tb_alloc(&tcg_ctx); + if (unlikely(tb == NULL)) { return NULL; } - tb = &tcg_ctx.tb_ctx.tbs[tcg_ctx.tb_ctx.nb_tbs++]; + ctx = &tcg_ctx.tb_ctx; + if (unlikely(ctx->nb_tbs == ctx->tbs_size)) { + ctx->tbs_size *= 2; + ctx->tbs = g_renew(TranslationBlock *, ctx->tbs, ctx->tbs_size); + } + ctx->tbs[ctx->nb_tbs++] = tb; tb->pc = pc; tb->cflags = 0; tb->invalid = false; @@ -850,8 +858,8 @@ void tb_free(TranslationBlock *tb) Ignore the hard cases and just back up if this TB happens to be the last one generated. */ if (tcg_ctx.tb_ctx.nb_tbs > 0 && - tb == &tcg_ctx.tb_ctx.tbs[tcg_ctx.tb_ctx.nb_tbs - 1]) { - tcg_ctx.code_gen_ptr = tb->tc_ptr; + tb == tcg_ctx.tb_ctx.tbs[tcg_ctx.tb_ctx.nb_tbs - 1]) { + tcg_ctx.code_gen_ptr = tb->tc_ptr - sizeof(TranslationBlock); tcg_ctx.tb_ctx.nb_tbs--; } } @@ -1666,7 +1674,7 @@ static TranslationBlock *tb_find_pc(uintptr_t tc_ptr) m_max = tcg_ctx.tb_ctx.nb_tbs - 1; while (m_min <= m_max) { m = (m_min + m_max) >> 1; - tb = &tcg_ctx.tb_ctx.tbs[m]; + tb = tcg_ctx.tb_ctx.tbs[m]; v = (uintptr_t)tb->tc_ptr; if (v == tc_ptr) { return tb; @@ -1676,7 +1684,7 @@ static TranslationBlock *tb_find_pc(uintptr_t tc_ptr) m_min = m + 1; } } - return &tcg_ctx.tb_ctx.tbs[m_max]; + return tcg_ctx.tb_ctx.tbs[m_max]; } #if !defined(CONFIG_USER_ONLY) @@ -1874,7 +1882,7 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf) direct_jmp_count = 0; direct_jmp2_count = 0; for (i = 0; i < tcg_ctx.tb_ctx.nb_tbs; i++) { - tb = &tcg_ctx.tb_ctx.tbs[i]; + tb = tcg_ctx.tb_ctx.tbs[i]; target_code_size += tb->size; if (tb->size > max_target_code_size) { max_target_code_size = tb->size; @@ -1894,8 +1902,7 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf) cpu_fprintf(f, "gen code size %td/%zd\n", tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer, tcg_ctx.code_gen_highwater - tcg_ctx.code_gen_buffer); - cpu_fprintf(f, "TB count %d/%d\n", - tcg_ctx.tb_ctx.nb_tbs, tcg_ctx.code_gen_max_blocks); + cpu_fprintf(f, "TB count %d\n", tcg_ctx.tb_ctx.nb_tbs); cpu_fprintf(f, "TB avg target size %d max=%d bytes\n", tcg_ctx.tb_ctx.nb_tbs ? target_code_size / tcg_ctx.tb_ctx.nb_tbs : 0, -- 2.7.4 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] [PATCH v2 3/3] tcg: allocate TB structs before the corresponding translated code 2017-06-05 22:49 ` [Qemu-devel] [PATCH v2 3/3] tcg: allocate TB structs before the corresponding translated code Emilio G. Cota @ 2017-06-06 5:36 ` Pranith Kumar 2017-06-06 17:13 ` Emilio G. Cota 2017-06-06 8:24 ` Richard Henderson 1 sibling, 1 reply; 18+ messages in thread From: Pranith Kumar @ 2017-06-06 5:36 UTC (permalink / raw) To: Emilio G. Cota Cc: qemu-devel, Richard Henderson, Alex Bennée, Peter Maydell, Paolo Bonzini On Mon, Jun 5, 2017 at 6:49 PM, Emilio G. Cota <cota@braap.org> wrote: > Allocating an arbitrarily-sized array of tbs results in either > (a) a lot of memory wasted or (b) unnecessary flushes of the code > cache when we run out of TB structs in the array. > > An obvious solution would be to just malloc a TB struct when needed, > and keep the TB array as an array of pointers (recall that tb_find_pc() > needs the TB array to run in O(log n)). > > Perhaps a better solution, which is implemented in this patch, is to > allocate TB's right before the translated code they describe. This > results in some memory waste due to padding to have code and TBs in > separate cache lines--for instance, I measured 4.7% of padding in the > used portion of code_gen_buffer when booting aarch64 Linux on a > host with 64-byte cache lines. However, it can allow for optimizations > in some host architectures, since TCG backends could safely assume that > the TB and the corresponding translated code are very close to each > other in memory. See this message by rth for a detailed explanation: > > https://lists.gnu.org/archive/html/qemu-devel/2017-03/msg05172.html > Subject: Re: GSoC 2017 Proposal: TCG performance enhancements > Message-ID: <1e67644b-4b30-887e-d329-1848e94c9484@twiddle.net> Reviewed-by: Pranith Kumar <bobby.prani@gmail.com> Thanks for doing this Emilio. Do you plan to continue working on rth's suggestions in that email? If so, can we co-ordinate our work? -- Pranith ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] [PATCH v2 3/3] tcg: allocate TB structs before the corresponding translated code 2017-06-06 5:36 ` Pranith Kumar @ 2017-06-06 17:13 ` Emilio G. Cota 0 siblings, 0 replies; 18+ messages in thread From: Emilio G. Cota @ 2017-06-06 17:13 UTC (permalink / raw) To: Pranith Kumar Cc: qemu-devel, Richard Henderson, Alex Bennée, Peter Maydell, Paolo Bonzini On Tue, Jun 06, 2017 at 01:36:50 -0400, Pranith Kumar wrote: > Reviewed-by: Pranith Kumar <bobby.prani@gmail.com> > > Thanks for doing this Emilio. Do you plan to continue working on rth's > suggestions in that email? If so, can we co-ordinate our work? My plan is to work on instrumentation. This was just low-hanging fruit; I was curious to see the impact on cache miss rates of bringing the TB's close to the corresponding translated code. Turns out it's pretty small or my L1's are too big :-) The memory savings are significant though, with the added benefit that this can enable more efficient translated code as Richard pointed out. I've just left a message on the GSoC thread with ideas. Emilio ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] [PATCH v2 3/3] tcg: allocate TB structs before the corresponding translated code 2017-06-05 22:49 ` [Qemu-devel] [PATCH v2 3/3] tcg: allocate TB structs before the corresponding translated code Emilio G. Cota 2017-06-06 5:36 ` Pranith Kumar @ 2017-06-06 8:24 ` Richard Henderson 2017-06-06 16:25 ` Emilio G. Cota 1 sibling, 1 reply; 18+ messages in thread From: Richard Henderson @ 2017-06-06 8:24 UTC (permalink / raw) To: Emilio G. Cota, qemu-devel Cc: alex.bennee, Peter Maydell, Paolo Bonzini, Pranith Kumar On 06/05/2017 03:49 PM, Emilio G. Cota wrote: > +TranslationBlock *tcg_tb_alloc(TCGContext *s) > +{ > + void *aligned; > + > + aligned = (void *)ROUND_UP((uintptr_t)s->code_gen_ptr, QEMU_CACHELINE_SIZE); > + if (unlikely(aligned + sizeof(TranslationBlock) > s->code_gen_highwater)) { > + return NULL; > + } > + s->code_gen_ptr += aligned - s->code_gen_ptr + sizeof(TranslationBlock); > + return aligned; We don't really need the 2/3 patch. We don't gain anything by telling the compiler that the structure is more aligned than it needs to be. We can query the line size at runtime, as suggested by Pranith, and use that for the alignment here. Which means that the binary isn't tied to a particular cpu implementation, which is clearly preferable for distributions. r~ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] [PATCH v2 3/3] tcg: allocate TB structs before the corresponding translated code 2017-06-06 8:24 ` Richard Henderson @ 2017-06-06 16:25 ` Emilio G. Cota 2017-06-06 17:02 ` Richard Henderson 0 siblings, 1 reply; 18+ messages in thread From: Emilio G. Cota @ 2017-06-06 16:25 UTC (permalink / raw) To: Richard Henderson Cc: qemu-devel, alex.bennee, Peter Maydell, Paolo Bonzini, Pranith Kumar On Tue, Jun 06, 2017 at 01:24:11 -0700, Richard Henderson wrote: > On 06/05/2017 03:49 PM, Emilio G. Cota wrote: > >+TranslationBlock *tcg_tb_alloc(TCGContext *s) > >+{ > >+ void *aligned; > >+ > >+ aligned = (void *)ROUND_UP((uintptr_t)s->code_gen_ptr, QEMU_CACHELINE_SIZE); > >+ if (unlikely(aligned + sizeof(TranslationBlock) > s->code_gen_highwater)) { > >+ return NULL; > >+ } > >+ s->code_gen_ptr += aligned - s->code_gen_ptr + sizeof(TranslationBlock); > >+ return aligned; > > We don't really need the 2/3 patch. We don't gain anything by telling the > compiler that the structure is more aligned than it needs to be. The compile-time requirement is for the compiler to pad the structs appropriately; this is critical to avoid false sharing when allocating arrays of structs like those test programs do. > We can query the line size at runtime, as suggested by Pranith, and use that > for the alignment here. Which means that the binary isn't tied to a > particular cpu implementation, which is clearly preferable for > distributions. For this particular case we can get away without padding the structs if we're OK with having the end of a TB struct immediately followed by its translated code, instead of having that code on the following cache line. E. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] [PATCH v2 3/3] tcg: allocate TB structs before the corresponding translated code 2017-06-06 16:25 ` Emilio G. Cota @ 2017-06-06 17:02 ` Richard Henderson 2017-06-06 17:31 ` Emilio G. Cota 0 siblings, 1 reply; 18+ messages in thread From: Richard Henderson @ 2017-06-06 17:02 UTC (permalink / raw) To: Emilio G. Cota Cc: qemu-devel, alex.bennee, Peter Maydell, Paolo Bonzini, Pranith Kumar On 06/06/2017 09:25 AM, Emilio G. Cota wrote: > For this particular case we can get away without padding the structs if > we're OK with having the end of a TB struct immediately followed > by its translated code, instead of having that code on the following > cache line. Uh, no, if you can manually pad before the struct, you can manually pad after the struct too. r~ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] [PATCH v2 3/3] tcg: allocate TB structs before the corresponding translated code 2017-06-06 17:02 ` Richard Henderson @ 2017-06-06 17:31 ` Emilio G. Cota 0 siblings, 0 replies; 18+ messages in thread From: Emilio G. Cota @ 2017-06-06 17:31 UTC (permalink / raw) To: Richard Henderson Cc: qemu-devel, alex.bennee, Peter Maydell, Paolo Bonzini, Pranith Kumar On Tue, Jun 06, 2017 at 10:02:17 -0700, Richard Henderson wrote: > On 06/06/2017 09:25 AM, Emilio G. Cota wrote: > >For this particular case we can get away without padding the structs if > >we're OK with having the end of a TB struct immediately followed > >by its translated code, instead of having that code on the following > >cache line. > > Uh, no, if you can manually pad before the struct, you can manually pad > after the struct too. Yes of course =) I'll respin the series to do this. E. ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2017-06-06 22:02 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-06-05 22:49 [Qemu-devel] [PATCH v2 0/3] tcg: allocate TB structs preceding translated code Emilio G. Cota 2017-06-05 22:49 ` [Qemu-devel] [PATCH v2 1/3] compiler: define QEMU_CACHELINE_SIZE Emilio G. Cota 2017-06-06 5:39 ` Pranith Kumar 2017-06-06 8:18 ` Richard Henderson 2017-06-06 16:11 ` Emilio G. Cota 2017-06-06 17:39 ` Richard Henderson 2017-06-06 20:28 ` Geert Martin Ijewski 2017-06-06 21:38 ` Emilio G. Cota 2017-06-06 22:01 ` Geert Martin Ijewski 2017-06-05 22:49 ` [Qemu-devel] [PATCH v2 2/3] tests: use QEMU_CACHELINE_SIZE instead of hard-coding it Emilio G. Cota 2017-06-06 5:40 ` Pranith Kumar 2017-06-05 22:49 ` [Qemu-devel] [PATCH v2 3/3] tcg: allocate TB structs before the corresponding translated code Emilio G. Cota 2017-06-06 5:36 ` Pranith Kumar 2017-06-06 17:13 ` Emilio G. Cota 2017-06-06 8:24 ` Richard Henderson 2017-06-06 16:25 ` Emilio G. Cota 2017-06-06 17:02 ` Richard Henderson 2017-06-06 17:31 ` Emilio G. Cota
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.