On Tue, Aug 22, 2017 at 2:24 PM, Andi Kleen wrote: > > I believe in this case it's used by threads, so a reference count limit > wouldn't help. For the first migration try, yes. But if it's some kind of "try and try again" pattern, the second time you try and there are people waiting for the page, the page count (not the map count) would be elevanted. So it's possible that depending on exactly what the deeper problem is, the "this page is very busy, don't migrate" case might be discoverable, and the page count might be part of it. However, after PeterZ made that comment that page migration should have that should_numa_migrate_memory() filter, I am looking at that mpol_misplaced() code. And honestly, that MPOL_PREFERRED / MPOL_F_LOCAL case really looks like complete garbage to me. It looks like garbage exactly because it says "always migrate to the current node", but that's crazy - if it's a group of threads all running together on the same VM, that obviously will just bounce the page around for absolute zero good ewason. The *other* memory policies look fairly sane. They basically have a fairly well-defined preferred node for the policy (although the "MPOL_INTERLEAVE" looks wrong for a hugepage). But MPOL_PREFERRED/MPOL_F_LOCAL really looks completely broken. Maybe people expected that anybody who uses MPOL_F_LOCAL will also bind all threads to one single node? Could we perhaps make that "MPOL_PREFERRED / MPOL_F_LOCAL" case just do the MPOL_F_MORON policy, which *does* use that "should I migrate to the local node" filter? IOW, we've been looking at the waiters (because the problem shows up due to the excessive wait queues), but maybe the source of the problem comes from the numa balancing code just insanely bouncing pages back-and-forth if you use that "always balance to local node" thing. Untested (as always) patch attached. Linus