On Thu, Mar 22, 2012 at 06:58:29AM -0700, Dan Smith wrote:
> AA> The only reasonable explanation I can imagine for the weird stuff
> AA> going on with "numa01_inverse" is that maybe it was compiled without
> AA> -DHARD_BIND? I forgot to specify -DINVERSE_BIND is a noop unless
> AA> -DHARD_BIND is specified too at the same time. -DINVERSE_BIND alone
> AA> results in the default build without -D parameters.
> 
> Ah, yeah, that's probably it. Later I'll try re-running some of the
> cases to verify.

Ok! Please checkout also autonuma branch again, or autonuma-dev, if you
re-run on autonuma, because I had a bug that autonuma would optimize
the hard binds too :). Now they're fully obeyed (I already obeyed
vma_policy(vma) but I forgot to add a check on current->mempolicy to
be null).

BTW, in the meantime I've some virt bench.. attached screenshot of
vnc, first run is with autonuma off, second and third run are with
autonuma on. The full_scan is increased every 10 sec with autonuma on,
so the scanning overhead is being measured. Autonuma off picks the
wrong node roughly 50% of the time and you can see the difference in
elapsed time when it happens. AutoNUMA gets it right 100% of the time
thanks to autonuma_balance (always "16sec" vs "16 sec or 26 sec" is a
great improvement).

I also tried to measure a kernel build of a VM that fits in one node
(in CPU and RAM) but I get badly bitten by HT effects, I should
basically notice in autonuma_balance that it's better to spread the
load to remote nodes if the remote nodes have full-core idle, while
the local node has only the HT sibling idle. What a mess. Anyway
current code would optimally perform, if all nodes are busy and there
aren't idle cores (or only idle siblings). I guess I'll leave the HT
optimizations for later. I probably shall measure this again with HT off.

Running the kernel build on a VM that spans over the whole system
wouldn't be ok until I run autonuma in the guest too, but to do that I
need a vtopology and I haven't tried how to tell qemu to do that yet.
Otherwise without autonuma in guest too, the guest scheduler will
freely move guest gcc tasks from vCPU0 to vCPU1 and maybe those two
are on threads that lives in different nodes on the host, so
triggering potentially spurious memory migrations triggered by the
guest scheduler not being aware.