* page_waitqueue() considered harmful @ 2016-09-26 20:58 Linus Torvalds 2016-09-26 21:23 ` Rik van Riel ` (3 more replies) 0 siblings, 4 replies; 42+ messages in thread From: Linus Torvalds @ 2016-09-26 20:58 UTC (permalink / raw) To: Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Mel Gorman, Peter Zijlstra, Rik van Riel Cc: linux-mm So I've been doing some profiling of the git "make test" load, which is interesting because it's just a lot of small scripts, and it shows our fork/execve/exit costs. Some of the top offenders are pretty understandable: we have long had unmap_page_range() show up at the top on these loads, because the zap_pte_range() function ends up touching each page as it unmaps things (it does it to check whether it's an anonymous page, but then also for the page map count update etc), and that's a disaster from a cache standpoint. That single function is something between 3-4% of CPU time, and the one instruction that accesses "struct page" the first time is a large portion of that. Yes, a single instruction is blamed for about 1% of all CPU time on a fork/exec/exit workload. Anyway, there really isn't a ton to be done about it. Same goes for the reverse side of the coin: filemap_map_pages() (to map in the new executable pages) is #2, and copy_page() (COW after fork()) is #3. Those are all kind of inevitable for the load. #4 on my list is "native_irq_return_iret", which is just a sign of serializing instructions being really expensive, and this being a workload with a lot of exceptions (most of which are the page faults). So I guess there are *two* instructions in the kernel that are really really hot. Maybe Intel will fix the cost of "iret" some day, at least partially, but that day has not yet come to pass. Anyway, things get kind of interesting once you get past the very top offenders, and the profile starts to be less about "yeah, tough, can't fix that" and instead hit things that make you go "ehh, really?" #5 and #6 on my profile are user space (_int_malloc in glibc, and do_lookup_x in the loader - I think user space should probably start thinking more about doing static libraries for the really core basic things, but whatever. Not a kernel issue. #7 is in the kernel again. And that one struck me as really odd. It's "unlock_page()", while #9 is __wake_up_bit(). WTF? There's no IO in this load, it's all cached, why do we use 3% of the time (1.7% and 1.4% respectively) on unlocking a page. And why can't I see the locking part? It turns out that I *can* see the locking part, but it's pretty cheap. It's inside of filemap_map_pages(), which does a trylock, and it shows up as about 1/6th of the cost of that function. Still, it's much less than the unlocking side. Why is unlocking so expensive? Yeah, unlocking is expensive because of the nasty __wake_up_bit() code. In fact, even inside "unlock_page()" itself, most of the costs aren't even the atomic bit clearing (like you'd expect), it's the inlined part of wake_up_bit(). Which does some really nasty crud. Why is the page_waitqueue() handling so expensive? Let me count the ways: (a) stupid code generation interacting really badly with microarchitecture We clear the bit in page->flags with a byte-sized "lock andb", and then page_waitqueue() looks up the zone by reading the full value of "page->flags" immediately afterwards, which causes bad bad behavior with the whole read-after-partial-write thing. Nasty. (b) It's cache miss heaven. It takes a cache miss on three different things:looking up the zone 'wait_table', then looking up the hash queue there, and finally (inside __wake_up_bit) looking up the wait queue itself (which will effectively always be NULL). Now, (a) could be fairly easy to fix several ways (the byte-size operation on an "unsigned long" field have caused problems before, and may just be a mistake), but (a) wouldn't even be a problem if we didn't have the complexity of (b) and having to look up the zone and everything. So realistically, it's actually (b) that is the primary problem, and indirectly causes (a) to happen too. Is there really any reason for that incredible indirection? Do we really want to make the page_waitqueue() be a per-zone thing at all? Especially since all those wait-queues won't even be *used* unless there is actual IO going on and people are really getting into contention on the page lock.. Why isn't the page_waitqueue() just one statically sized array? Also, if those bitlock ops had a different bit that showed contention, we could actually skip *all* of this, and just see that "oh, nobody is waiting on this page anyway, so there's no point in looking up those wait queues". We don't have that many "__wait_on_bit()" users, maybe we could say that the bitlocks do have to haev *two* bits: one for the lock bit itself, and one for "there is contention". Looking at the history of this code, most of it actually predates the git history, it goes back to ~2004. Time to revisit that thing? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-26 20:58 page_waitqueue() considered harmful Linus Torvalds @ 2016-09-26 21:23 ` Rik van Riel 2016-09-26 21:30 ` Linus Torvalds 2016-09-26 23:11 ` Kirill A. Shutemov 2016-09-27 7:30 ` Peter Zijlstra ` (2 subsequent siblings) 3 siblings, 2 replies; 42+ messages in thread From: Rik van Riel @ 2016-09-26 21:23 UTC (permalink / raw) To: Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Mel Gorman, Peter Zijlstra Cc: linux-mm [-- Attachment #1: Type: text/plain, Size: 1248 bytes --] On Mon, 2016-09-26 at 13:58 -0700, Linus Torvalds wrote: > Is there really any reason for that incredible indirection? Do we > really want to make the page_waitqueue() be a per-zone thing at all? > Especially since all those wait-queues won't even be *used* unless > there is actual IO going on and people are really getting into > contention on the page lock.. Why isn't the page_waitqueue() just one > statically sized array? Why are we touching file pages at all during fork()? Could we get away with skipping copy_page_range on VMAs that do not have any anonymous pages? Could we teach copy_page_range to skip file PTEs during fork? The child process can just fault the needed file pages in after it has been forked off. Given how common fork + exec are, there is a real chance that: - the child process did not need all of the parent's pages, and - the parent process does not have some of the bits needed by the child (for exec) mapped into its page tables, anyway (as suggested by the child processes page faulting on file pages) Having fewer pages mapped might also make zap_page_range a little cheaper, both at exec and at exit time. Am I overlooking something? -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 473 bytes --] ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-26 21:23 ` Rik van Riel @ 2016-09-26 21:30 ` Linus Torvalds 2016-09-26 23:11 ` Kirill A. Shutemov 1 sibling, 0 replies; 42+ messages in thread From: Linus Torvalds @ 2016-09-26 21:30 UTC (permalink / raw) To: Rik van Riel Cc: Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Mel Gorman, Peter Zijlstra, linux-mm On Mon, Sep 26, 2016 at 2:23 PM, Rik van Riel <riel@redhat.com> wrote: > > Why are we touching file pages at all during fork()? This is *not* fork. This is fork/exec/exit. It's a real load, it's just fairly concentrated by the scripts being many and small. It's easy enough to try yourself: perf record -e cycles:pp make -j16 test in the git tree. > Could we get away with skipping copy_page_range on VMAs > that do not have any anonymous pages? You still have to duplicate those ranges, but that's fairly cheap. >From what I can tell, the expensive part really is the "page in new executable/library pages" and then tearing them down again (because it was just a quick small git process or a very small shell script). So forget about fork(). I'm not even sure how much of that there is, it's possible that you end up having vfork() instead. It's exec/exit that matters most in this load. (You can see that in the profile with things like strnlen_user() actually being fairly high on the profile too - mostly the environment variables during exec). Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-26 21:23 ` Rik van Riel 2016-09-26 21:30 ` Linus Torvalds @ 2016-09-26 23:11 ` Kirill A. Shutemov 2016-09-27 1:01 ` Rik van Riel 1 sibling, 1 reply; 42+ messages in thread From: Kirill A. Shutemov @ 2016-09-26 23:11 UTC (permalink / raw) To: Rik van Riel Cc: Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Mel Gorman, Peter Zijlstra, linux-mm On Mon, Sep 26, 2016 at 05:23:29PM -0400, Rik van Riel wrote: > On Mon, 2016-09-26 at 13:58 -0700, Linus Torvalds wrote: > > > Is there really any reason for that incredible indirection? Do we > > really want to make the page_waitqueue() be a per-zone thing at all? > > Especially since all those wait-queues won't even be *used* unless > > there is actual IO going on and people are really getting into > > contention on the page lock.. Why isn't the page_waitqueue() just one > > statically sized array? > > Why are we touching file pages at all during fork()? We are not. Unless the vma has private pages (vma->anon_vma is not NULL). See first lines for copy_page_range(). We probably can go futher and skip non-private pages within file VMA. But we would need to touch struct page in this case, so it doesn't make sense. -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-26 23:11 ` Kirill A. Shutemov @ 2016-09-27 1:01 ` Rik van Riel 0 siblings, 0 replies; 42+ messages in thread From: Rik van Riel @ 2016-09-27 1:01 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Mel Gorman, Peter Zijlstra, linux-mm [-- Attachment #1: Type: text/plain, Size: 920 bytes --] On Tue, 2016-09-27 at 02:11 +0300, Kirill A. Shutemov wrote: > On Mon, Sep 26, 2016 at 05:23:29PM -0400, Rik van Riel wrote: > > > > On Mon, 2016-09-26 at 13:58 -0700, Linus Torvalds wrote: > > > > > > > > Is there really any reason for that incredible indirection? Do we > > > really want to make the page_waitqueue() be a per-zone thing at > > > all? > > > Especially since all those wait-queues won't even be *used* > > > unless > > > there is actual IO going on and people are really getting into > > > contention on the page lock.. Why isn't the page_waitqueue() just > > > one > > > statically sized array? > > > > Why are we touching file pages at all during fork()? > > We are not. > Unless the vma has private pages (vma->anon_vma is not NULL). > > See first lines for copy_page_range(). Ahhh, indeed. I thought I remembered an optimization like that. -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 473 bytes --] ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-26 20:58 page_waitqueue() considered harmful Linus Torvalds 2016-09-26 21:23 ` Rik van Riel @ 2016-09-27 7:30 ` Peter Zijlstra 2016-09-27 8:54 ` Mel Gorman 2016-09-27 8:03 ` Jan Kara 2016-09-27 8:31 ` Mel Gorman 3 siblings, 1 reply; 42+ messages in thread From: Peter Zijlstra @ 2016-09-27 7:30 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Mel Gorman, Rik van Riel, linux-mm On Mon, Sep 26, 2016 at 01:58:00PM -0700, Linus Torvalds wrote: > Why is the page_waitqueue() handling so expensive? Let me count the ways: > (b) It's cache miss heaven. It takes a cache miss on three different > things:looking up the zone 'wait_table', then looking up the hash > queue there, and finally (inside __wake_up_bit) looking up the wait > queue itself (which will effectively always be NULL). > Is there really any reason for that incredible indirection? Do we > really want to make the page_waitqueue() be a per-zone thing at all? > Especially since all those wait-queues won't even be *used* unless > there is actual IO going on and people are really getting into > contention on the page lock.. Why isn't the page_waitqueue() just one > statically sized array? I suspect the reason is to have per node hash tables, just like we get per node page-frame arrays with sparsemem. > Also, if those bitlock ops had a different bit that showed contention, > we could actually skip *all* of this, and just see that "oh, nobody is > waiting on this page anyway, so there's no point in looking up those > wait queues". We don't have that many "__wait_on_bit()" users, maybe > we could say that the bitlocks do have to haev *two* bits: one for the > lock bit itself, and one for "there is contention". That would be fairly simple to implement, the difficulty would be actually getting a page-flag to use for this. We're running pretty low in available bits :/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-27 7:30 ` Peter Zijlstra @ 2016-09-27 8:54 ` Mel Gorman 2016-09-27 9:11 ` Kirill A. Shutemov 2016-09-29 8:01 ` Peter Zijlstra 0 siblings, 2 replies; 42+ messages in thread From: Mel Gorman @ 2016-09-27 8:54 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel, linux-mm On Tue, Sep 27, 2016 at 09:30:55AM +0200, Peter Zijlstra wrote: > > Also, if those bitlock ops had a different bit that showed contention, > > we could actually skip *all* of this, and just see that "oh, nobody is > > waiting on this page anyway, so there's no point in looking up those > > wait queues". We don't have that many "__wait_on_bit()" users, maybe > > we could say that the bitlocks do have to haev *two* bits: one for the > > lock bit itself, and one for "there is contention". > > That would be fairly simple to implement, the difficulty would be > actually getting a page-flag to use for this. We're running pretty low > in available bits :/ Simple is relative unless I drastically overcomplicated things and it wouldn't be the first time. 64-bit only side-steps the page flag issue as long as we can live with that. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-27 8:54 ` Mel Gorman @ 2016-09-27 9:11 ` Kirill A. Shutemov 2016-09-27 9:42 ` Mel Gorman 2016-09-27 9:52 ` Minchan Kim 2016-09-29 8:01 ` Peter Zijlstra 1 sibling, 2 replies; 42+ messages in thread From: Kirill A. Shutemov @ 2016-09-27 9:11 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel, linux-mm On Tue, Sep 27, 2016 at 09:54:12AM +0100, Mel Gorman wrote: > On Tue, Sep 27, 2016 at 09:30:55AM +0200, Peter Zijlstra wrote: > > > Also, if those bitlock ops had a different bit that showed contention, > > > we could actually skip *all* of this, and just see that "oh, nobody is > > > waiting on this page anyway, so there's no point in looking up those > > > wait queues". We don't have that many "__wait_on_bit()" users, maybe > > > we could say that the bitlocks do have to haev *two* bits: one for the > > > lock bit itself, and one for "there is contention". > > > > That would be fairly simple to implement, the difficulty would be > > actually getting a page-flag to use for this. We're running pretty low > > in available bits :/ > > Simple is relative unless I drastically overcomplicated things and it > wouldn't be the first time. 64-bit only side-steps the page flag issue > as long as we can live with that. Looks like we don't ever lock slab pages. Unless I miss something. We can try to use PG_locked + PG_slab to indicate contation. I tried to boot kernel with CONFIG_SLUB + BUG_ON(PageSlab()) in trylock/unlock_page() codepath. Works fine, but more inspection is required. -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-27 9:11 ` Kirill A. Shutemov @ 2016-09-27 9:42 ` Mel Gorman 2016-09-27 9:52 ` Minchan Kim 1 sibling, 0 replies; 42+ messages in thread From: Mel Gorman @ 2016-09-27 9:42 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel, linux-mm On Tue, Sep 27, 2016 at 12:11:17PM +0300, Kirill A. Shutemov wrote: > On Tue, Sep 27, 2016 at 09:54:12AM +0100, Mel Gorman wrote: > > On Tue, Sep 27, 2016 at 09:30:55AM +0200, Peter Zijlstra wrote: > > > > Also, if those bitlock ops had a different bit that showed contention, > > > > we could actually skip *all* of this, and just see that "oh, nobody is > > > > waiting on this page anyway, so there's no point in looking up those > > > > wait queues". We don't have that many "__wait_on_bit()" users, maybe > > > > we could say that the bitlocks do have to haev *two* bits: one for the > > > > lock bit itself, and one for "there is contention". > > > > > > That would be fairly simple to implement, the difficulty would be > > > actually getting a page-flag to use for this. We're running pretty low > > > in available bits :/ > > > > Simple is relative unless I drastically overcomplicated things and it > > wouldn't be the first time. 64-bit only side-steps the page flag issue > > as long as we can live with that. > > Looks like we don't ever lock slab pages. Unless I miss something. > I don't think we do but direct PageSlab checks might be problematic if it was a false-positive due to a locked page and we'd have to be very careful about any races due to two bits being used. While we shouldn't rule it out, I think it's important to first look at that original patch and see if it's remotely acceptable and makes enough difference to a real workload to matter. If so, then we could consider additional complexity on top to make it work on 32-bit -- maybe separated by one release as it took a long time to flush out subtle bugs with the PG_waiters approach. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-27 9:11 ` Kirill A. Shutemov 2016-09-27 9:42 ` Mel Gorman @ 2016-09-27 9:52 ` Minchan Kim 2016-09-27 12:11 ` Kirill A. Shutemov 1 sibling, 1 reply; 42+ messages in thread From: Minchan Kim @ 2016-09-27 9:52 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Mel Gorman, Peter Zijlstra, Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel, linux-mm On Tue, Sep 27, 2016 at 12:11:17PM +0300, Kirill A. Shutemov wrote: > On Tue, Sep 27, 2016 at 09:54:12AM +0100, Mel Gorman wrote: > > On Tue, Sep 27, 2016 at 09:30:55AM +0200, Peter Zijlstra wrote: > > > > Also, if those bitlock ops had a different bit that showed contention, > > > > we could actually skip *all* of this, and just see that "oh, nobody is > > > > waiting on this page anyway, so there's no point in looking up those > > > > wait queues". We don't have that many "__wait_on_bit()" users, maybe > > > > we could say that the bitlocks do have to haev *two* bits: one for the > > > > lock bit itself, and one for "there is contention". > > > > > > That would be fairly simple to implement, the difficulty would be > > > actually getting a page-flag to use for this. We're running pretty low > > > in available bits :/ > > > > Simple is relative unless I drastically overcomplicated things and it > > wouldn't be the first time. 64-bit only side-steps the page flag issue > > as long as we can live with that. > > Looks like we don't ever lock slab pages. Unless I miss something. > > We can try to use PG_locked + PG_slab to indicate contation. > > I tried to boot kernel with CONFIG_SLUB + BUG_ON(PageSlab()) in > trylock/unlock_page() codepath. Works fine, but more inspection is > required. SLUB used bit_spin_lock via slab_lock instead of trylock/unlock. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-27 9:52 ` Minchan Kim @ 2016-09-27 12:11 ` Kirill A. Shutemov 0 siblings, 0 replies; 42+ messages in thread From: Kirill A. Shutemov @ 2016-09-27 12:11 UTC (permalink / raw) To: Minchan Kim Cc: Mel Gorman, Peter Zijlstra, Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel, linux-mm On Tue, Sep 27, 2016 at 06:52:06PM +0900, Minchan Kim wrote: > On Tue, Sep 27, 2016 at 12:11:17PM +0300, Kirill A. Shutemov wrote: > > On Tue, Sep 27, 2016 at 09:54:12AM +0100, Mel Gorman wrote: > > > On Tue, Sep 27, 2016 at 09:30:55AM +0200, Peter Zijlstra wrote: > > > > > Also, if those bitlock ops had a different bit that showed contention, > > > > > we could actually skip *all* of this, and just see that "oh, nobody is > > > > > waiting on this page anyway, so there's no point in looking up those > > > > > wait queues". We don't have that many "__wait_on_bit()" users, maybe > > > > > we could say that the bitlocks do have to haev *two* bits: one for the > > > > > lock bit itself, and one for "there is contention". > > > > > > > > That would be fairly simple to implement, the difficulty would be > > > > actually getting a page-flag to use for this. We're running pretty low > > > > in available bits :/ > > > > > > Simple is relative unless I drastically overcomplicated things and it > > > wouldn't be the first time. 64-bit only side-steps the page flag issue > > > as long as we can live with that. > > > > Looks like we don't ever lock slab pages. Unless I miss something. > > > > We can try to use PG_locked + PG_slab to indicate contation. > > > > I tried to boot kernel with CONFIG_SLUB + BUG_ON(PageSlab()) in > > trylock/unlock_page() codepath. Works fine, but more inspection is > > required. > > SLUB used bit_spin_lock via slab_lock instead of trylock/unlock. Ahh. Missed that. -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-27 8:54 ` Mel Gorman 2016-09-27 9:11 ` Kirill A. Shutemov @ 2016-09-29 8:01 ` Peter Zijlstra 2016-09-29 12:55 ` Nicholas Piggin 1 sibling, 1 reply; 42+ messages in thread From: Peter Zijlstra @ 2016-09-29 8:01 UTC (permalink / raw) To: Mel Gorman Cc: Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel, linux-mm On Tue, Sep 27, 2016 at 09:54:12AM +0100, Mel Gorman wrote: > On Tue, Sep 27, 2016 at 09:30:55AM +0200, Peter Zijlstra wrote: > Simple is relative unless I drastically overcomplicated things and it > wouldn't be the first time. 64-bit only side-steps the page flag issue > as long as we can live with that. So one problem with the 64bit only pageflags is that they do eat space from page-flags-layout, we do try and fit a bunch of other crap in there, and at some point that all will not fit anymore and we'll revert to worse. I've no idea how far away from that we are for distro kernels. I suppose they have fairly large NR_NODES and NR_CPUS. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-29 8:01 ` Peter Zijlstra @ 2016-09-29 12:55 ` Nicholas Piggin 2016-09-29 13:16 ` Peter Zijlstra 2016-09-29 15:05 ` Rik van Riel 0 siblings, 2 replies; 42+ messages in thread From: Nicholas Piggin @ 2016-09-29 12:55 UTC (permalink / raw) To: Peter Zijlstra Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel, linux-mm On Thu, 29 Sep 2016 10:01:30 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, Sep 27, 2016 at 09:54:12AM +0100, Mel Gorman wrote: > > On Tue, Sep 27, 2016 at 09:30:55AM +0200, Peter Zijlstra wrote: > > Simple is relative unless I drastically overcomplicated things and it > > wouldn't be the first time. 64-bit only side-steps the page flag issue > > as long as we can live with that. > > So one problem with the 64bit only pageflags is that they do eat space > from page-flags-layout, we do try and fit a bunch of other crap in > there, and at some point that all will not fit anymore and we'll revert > to worse. > > I've no idea how far away from that we are for distro kernels. I suppose > they have fairly large NR_NODES and NR_CPUS. I know it's not fashionable to care about them anymore, but it's sad if 32-bit architectures miss out fundamental optimisations like this because we're out of page flags. It would also be sad to increase the size of struct page because we're too lazy to reduce flags. There's some that might be able to be removed. PG_reserved - We should have killed this years ago. More users have crept back in. PG_mappedtodisk - Rarely used, to slightly shortcut some mapping lookups. Possible for filesystems to derive this some other way, e.g., PG_private == 1 && ->private == NULL, or another filesystem private bit. We should really kill this before more users spring up and it gets stuck forever. PG_swapcache - can this be replaced with ane of the private bits, I wonder? PG_uncached - this is PG_arch_2 really, but unfortunately x86 uses both. Still, that and PG_arch_1 could be removed from most architectures, so many of the 32-bit ones could use the extra flags. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-29 12:55 ` Nicholas Piggin @ 2016-09-29 13:16 ` Peter Zijlstra 2016-09-29 13:54 ` Nicholas Piggin 2016-09-29 15:05 ` Rik van Riel 1 sibling, 1 reply; 42+ messages in thread From: Peter Zijlstra @ 2016-09-29 13:16 UTC (permalink / raw) To: Nicholas Piggin Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel, linux-mm On Thu, Sep 29, 2016 at 10:55:44PM +1000, Nicholas Piggin wrote: > On Thu, 29 Sep 2016 10:01:30 +0200 > Peter Zijlstra <peterz@infradead.org> wrote: > > > On Tue, Sep 27, 2016 at 09:54:12AM +0100, Mel Gorman wrote: > > > On Tue, Sep 27, 2016 at 09:30:55AM +0200, Peter Zijlstra wrote: > > > Simple is relative unless I drastically overcomplicated things and it > > > wouldn't be the first time. 64-bit only side-steps the page flag issue > > > as long as we can live with that. > > > > So one problem with the 64bit only pageflags is that they do eat space > > from page-flags-layout, we do try and fit a bunch of other crap in > > there, and at some point that all will not fit anymore and we'll revert > > to worse. > > > > I've no idea how far away from that we are for distro kernels. I suppose > > they have fairly large NR_NODES and NR_CPUS. > > I know it's not fashionable to care about them anymore, but it's sad if > 32-bit architectures miss out fundamental optimisations like this because > we're out of page flags. It would also be sad to increase the size of > struct page because we're too lazy to reduce flags. There's some that > might be able to be removed. I'm all for cleaning some of that up, but its been a long while since I poked in that general area. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-29 13:16 ` Peter Zijlstra @ 2016-09-29 13:54 ` Nicholas Piggin 0 siblings, 0 replies; 42+ messages in thread From: Nicholas Piggin @ 2016-09-29 13:54 UTC (permalink / raw) To: Peter Zijlstra Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel, linux-mm On Thu, 29 Sep 2016 15:16:35 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Sep 29, 2016 at 10:55:44PM +1000, Nicholas Piggin wrote: > > On Thu, 29 Sep 2016 10:01:30 +0200 > > Peter Zijlstra <peterz@infradead.org> wrote: > > > > > On Tue, Sep 27, 2016 at 09:54:12AM +0100, Mel Gorman wrote: > > > > On Tue, Sep 27, 2016 at 09:30:55AM +0200, Peter Zijlstra wrote: > > > > Simple is relative unless I drastically overcomplicated things and it > > > > wouldn't be the first time. 64-bit only side-steps the page flag issue > > > > as long as we can live with that. > > > > > > So one problem with the 64bit only pageflags is that they do eat space > > > from page-flags-layout, we do try and fit a bunch of other crap in > > > there, and at some point that all will not fit anymore and we'll revert > > > to worse. > > > > > > I've no idea how far away from that we are for distro kernels. I suppose > > > they have fairly large NR_NODES and NR_CPUS. > > > > I know it's not fashionable to care about them anymore, but it's sad if > > 32-bit architectures miss out fundamental optimisations like this because > > we're out of page flags. It would also be sad to increase the size of > > struct page because we're too lazy to reduce flags. There's some that > > might be able to be removed. > > I'm all for cleaning some of that up, but its been a long while since I > poked in that general area. Forgive my rant! Cleaning page flags is a topic of its own... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-29 12:55 ` Nicholas Piggin 2016-09-29 13:16 ` Peter Zijlstra @ 2016-09-29 15:05 ` Rik van Riel 1 sibling, 0 replies; 42+ messages in thread From: Rik van Riel @ 2016-09-29 15:05 UTC (permalink / raw) To: Nicholas Piggin, Peter Zijlstra Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, linux-mm [-- Attachment #1: Type: text/plain, Size: 698 bytes --] On Thu, 2016-09-29 at 22:55 +1000, Nicholas Piggin wrote: > PG_swapcache - can this be replaced with ane of the private bits, I > wonder? Perhaps. page->mapping needs to be free to point at the anon_vma, but from the mapping pointer we can see that the page is swap backed. Is there any use for page->private for swap backed pages that is not the page cache index? If so, (PageAnon(page) && page->private) might work as a replacement for PG_swapcache. That might catch some false positives with the special swap types used for migration, but maybe we do not need to care about those (much), or we can filter them out with a more in-depth check? -- All rights reversed [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 473 bytes --] ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-26 20:58 page_waitqueue() considered harmful Linus Torvalds 2016-09-26 21:23 ` Rik van Riel 2016-09-27 7:30 ` Peter Zijlstra @ 2016-09-27 8:03 ` Jan Kara 2016-09-27 8:31 ` Mel Gorman 3 siblings, 0 replies; 42+ messages in thread From: Jan Kara @ 2016-09-27 8:03 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Mel Gorman, Peter Zijlstra, Rik van Riel, linux-mm Hello, On Mon 26-09-16 13:58:00, Linus Torvalds wrote: > #7 is in the kernel again. And that one struck me as really odd. It's > "unlock_page()", while #9 is __wake_up_bit(). WTF? There's no IO in > this load, it's all cached, why do we use 3% of the time (1.7% and > 1.4% respectively) on unlocking a page. And why can't I see the > locking part? > > It turns out that I *can* see the locking part, but it's pretty cheap. > It's inside of filemap_map_pages(), which does a trylock, and it shows > up as about 1/6th of the cost of that function. Still, it's much less > than the unlocking side. Why is unlocking so expensive? > > Yeah, unlocking is expensive because of the nasty __wake_up_bit() > code. In fact, even inside "unlock_page()" itself, most of the costs > aren't even the atomic bit clearing (like you'd expect), it's the > inlined part of wake_up_bit(). Which does some really nasty crud. > > Why is the page_waitqueue() handling so expensive? Let me count the ways: > > (a) stupid code generation interacting really badly with microarchitecture > > We clear the bit in page->flags with a byte-sized "lock andb", > and then page_waitqueue() looks up the zone by reading the full value > of "page->flags" immediately afterwards, which causes bad bad behavior > with the whole read-after-partial-write thing. Nasty. > > (b) It's cache miss heaven. It takes a cache miss on three different > things:looking up the zone 'wait_table', then looking up the hash > queue there, and finally (inside __wake_up_bit) looking up the wait > queue itself (which will effectively always be NULL). > > Now, (a) could be fairly easy to fix several ways (the byte-size > operation on an "unsigned long" field have caused problems before, and > may just be a mistake), but (a) wouldn't even be a problem if we > didn't have the complexity of (b) and having to look up the zone and > everything. So realistically, it's actually (b) that is the primary > problem, and indirectly causes (a) to happen too. > > Is there really any reason for that incredible indirection? Do we > really want to make the page_waitqueue() be a per-zone thing at all? > Especially since all those wait-queues won't even be *used* unless > there is actual IO going on and people are really getting into > contention on the page lock.. Why isn't the page_waitqueue() just one > statically sized array? > > Also, if those bitlock ops had a different bit that showed contention, > we could actually skip *all* of this, and just see that "oh, nobody is > waiting on this page anyway, so there's no point in looking up those > wait queues". We don't have that many "__wait_on_bit()" users, maybe > we could say that the bitlocks do have to haev *two* bits: one for the > lock bit itself, and one for "there is contention". So we were carrying patches in SUSE kernels for a long time that did something like this (since 2011 it seems) since the unlock_page() was pretty visible on some crazy SGI workload on their supercomputer. But they were never accepted upstream and eventually we mostly dropped them (some less intrusive optimizations got merged) when rebasing on 3.12 - Mel did the evaluation at that time so he should be able to fill in details but at one place he writes: "These patches lack a compelling use case for pushing to mainline. They show a measurable gain in profiles but the gain is in the noise for the whole workload. It is known to improve boot times on very large machines and help an artifical test case but that is not a compelling reason to consume a page flag and push it to mainline. The patches are held in reserve until a compelling case for them is found." Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-26 20:58 page_waitqueue() considered harmful Linus Torvalds ` (2 preceding siblings ...) 2016-09-27 8:03 ` Jan Kara @ 2016-09-27 8:31 ` Mel Gorman 2016-09-27 14:34 ` Peter Zijlstra 2016-09-27 14:53 ` Nicholas Piggin 3 siblings, 2 replies; 42+ messages in thread From: Mel Gorman @ 2016-09-27 8:31 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Peter Zijlstra, Rik van Riel, linux-mm Hi Linus, On Mon, Sep 26, 2016 at 01:58:00PM -0700, Linus Torvalds wrote: > So I've been doing some profiling of the git "make test" load, which > is interesting because it's just a lot of small scripts, and it shows > our fork/execve/exit costs. > > Some of the top offenders are pretty understandable: we have long had > unmap_page_range() show up at the top on these loads, because the > zap_pte_range() function ends up touching each page as it unmaps > things (it does it to check whether it's an anonymous page, but then > also for the page map count update etc), and that's a disaster from a > cache standpoint. That single function is something between 3-4% of > CPU time, and the one instruction that accesses "struct page" the > first time is a large portion of that. Yes, a single instruction is > blamed for about 1% of all CPU time on a fork/exec/exit workload. > It was found at one point that the fault-around made these costs worse as there were simply more pages to tear down. However, this only applied to fork/exit microbenchmarks. Matt Fleming prototyped an unreleased patch that tried to be clever about this but the cost was never worthwhile. A plain revert helped a microbenchmark but hurt workloads like the git testsuite which was shell intensive. It got filed under "we're not fixing a fork/exit microbenchmark at the expense of "real" workloads like git checkout and git testsuite". > <SNIP> > > #5 and #6 on my profile are user space (_int_malloc in glibc, and > do_lookup_x in the loader - I think user space should probably start > thinking more about doing static libraries for the really core basic > things, but whatever. Not a kernel issue. > Recent problems have been fixed with _int_malloc in glibc, particularly as it applies to threads but no fix springs to mind that might impact "make test". > #7 is in the kernel again. And that one struck me as really odd. It's > "unlock_page()", while #9 is __wake_up_bit(). WTF? There's no IO in > this load, it's all cached, why do we use 3% of the time (1.7% and > 1.4% respectively) on unlocking a page. And why can't I see the > locking part? > > It turns out that I *can* see the locking part, but it's pretty cheap. > It's inside of filemap_map_pages(), which does a trylock, and it shows > up as about 1/6th of the cost of that function. Still, it's much less > than the unlocking side. Why is unlocking so expensive? > > Yeah, unlocking is expensive because of the nasty __wake_up_bit() > code. In fact, even inside "unlock_page()" itself, most of the costs > aren't even the atomic bit clearing (like you'd expect), it's the > inlined part of wake_up_bit(). Which does some really nasty crud. > > Why is the page_waitqueue() handling so expensive? Let me count the ways: > page_waitqueue() has been a hazard for years. I think the last attempt to fix it was back in 2014 http://www.spinics.net/lists/linux-mm/msg73207.html The patch is heavily derived from work by Nick Piggin who noticed the years before that. I think that was the last version I posted and the changelog includes profile data. I don't have an exact reference why it was rejected but a consistent piece of feedback was that it was very complex for the level of impact it had. > (a) stupid code generation interacting really badly with microarchitecture > > We clear the bit in page->flags with a byte-sized "lock andb", > and then page_waitqueue() looks up the zone by reading the full value > of "page->flags" immediately afterwards, which causes bad bad behavior > with the whole read-after-partial-write thing. Nasty. > Other than some code churn, it should be possible to lookup the waitqueue before clearing the bit if the patch that side-steps the lookup entirely is still distasteful. > (b) It's cache miss heaven. It takes a cache miss on three different > things:looking up the zone 'wait_table', then looking up the hash > queue there, and finally (inside __wake_up_bit) looking up the wait > queue itself (which will effectively always be NULL). > > Now, (a) could be fairly easy to fix several ways (the byte-size > operation on an "unsigned long" field have caused problems before, and > may just be a mistake), but (a) wouldn't even be a problem if we > didn't have the complexity of (b) and having to look up the zone and > everything. So realistically, it's actually (b) that is the primary > problem, and indirectly causes (a) to happen too. > The patch in question side-steps the (b) part. > Is there really any reason for that incredible indirection? Do we > really want to make the page_waitqueue() be a per-zone thing at all? > Especially since all those wait-queues won't even be *used* unless > there is actual IO going on and people are really getting into > contention on the page lock.. Why isn't the page_waitqueue() just one > statically sized array? > About all I can think of is NUMA locality. Doing it per-zone at the time would be convenient. It could be per-node but that's no better from a complexity point of view. If it was a static array, the key could be related to the node the page is on but that's still looking into the page flags. > Also, if those bitlock ops had a different bit that showed contention, > we could actually skip *all* of this, and just see that "oh, nobody is > waiting on this page anyway, so there's no point in looking up those > wait queues". That's very close to what the 2014 patch does. However, it's 64-bit only due to the use of page flags. Even though it's out of date, can you take a look? If it doesn't trigger hate, I can forward port it unless you do that yourself for the purposes of testing. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-27 8:31 ` Mel Gorman @ 2016-09-27 14:34 ` Peter Zijlstra 2016-09-27 15:08 ` Nicholas Piggin ` (2 more replies) 2016-09-27 14:53 ` Nicholas Piggin 1 sibling, 3 replies; 42+ messages in thread From: Peter Zijlstra @ 2016-09-27 14:34 UTC (permalink / raw) To: Mel Gorman Cc: Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel, linux-mm On Tue, Sep 27, 2016 at 09:31:04AM +0100, Mel Gorman wrote: > page_waitqueue() has been a hazard for years. I think the last attempt to > fix it was back in 2014 http://www.spinics.net/lists/linux-mm/msg73207.html > > The patch is heavily derived from work by Nick Piggin who noticed the years > before that. I think that was the last version I posted and the changelog > includes profile data. I don't have an exact reference why it was rejected > but a consistent piece of feedback was that it was very complex for the > level of impact it had. Right, I never really liked that patch. In any case, the below seems to boot, although the lock_page_wait() thing did get my brain in a bit of a twist. Doing explicit loops with PG_contended inside wq->lock would be more obvious but results in much more code. We could muck about with PG_contended naming/placement if any of this shows benefit. It does boot on my x86_64 and builds a kernel, so it must be perfect ;-) --- include/linux/page-flags.h | 2 ++ include/linux/pagemap.h | 21 +++++++++---- include/linux/wait.h | 2 +- include/trace/events/mmflags.h | 1 + kernel/sched/wait.c | 17 ++++++---- mm/filemap.c | 71 ++++++++++++++++++++++++++++++++++++------ 6 files changed, 92 insertions(+), 22 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 74e4dda..0ed3900 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -73,6 +73,7 @@ */ enum pageflags { PG_locked, /* Page is locked. Don't touch. */ + PG_contended, /* Page lock is contended. */ PG_error, PG_referenced, PG_uptodate, @@ -253,6 +254,7 @@ static inline int TestClearPage##uname(struct page *page) { return 0; } TESTSETFLAG_FALSE(uname) TESTCLEARFLAG_FALSE(uname) __PAGEFLAG(Locked, locked, PF_NO_TAIL) +PAGEFLAG(Contended, contended, PF_NO_TAIL) PAGEFLAG(Error, error, PF_NO_COMPOUND) TESTCLEARFLAG(Error, error, PF_NO_COMPOUND) PAGEFLAG(Referenced, referenced, PF_HEAD) TESTCLEARFLAG(Referenced, referenced, PF_HEAD) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 66a1260..3b38a96 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -417,7 +417,7 @@ extern void __lock_page(struct page *page); extern int __lock_page_killable(struct page *page); extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm, unsigned int flags); -extern void unlock_page(struct page *page); +extern void __unlock_page(struct page *page); static inline int trylock_page(struct page *page) { @@ -448,6 +448,16 @@ static inline int lock_page_killable(struct page *page) return 0; } +static inline void unlock_page(struct page *page) +{ + page = compound_head(page); + VM_BUG_ON_PAGE(!PageLocked(page), page); + clear_bit_unlock(PG_locked, &page->flags); + smp_mb__after_atomic(); + if (PageContended(page)) + __unlock_page(page); +} + /* * lock_page_or_retry - Lock the page, unless this would block and the * caller indicated that it can handle a retry. @@ -472,11 +482,11 @@ extern int wait_on_page_bit_killable(struct page *page, int bit_nr); extern int wait_on_page_bit_killable_timeout(struct page *page, int bit_nr, unsigned long timeout); +extern int wait_on_page_lock(struct page *page, int mode); + static inline int wait_on_page_locked_killable(struct page *page) { - if (!PageLocked(page)) - return 0; - return wait_on_page_bit_killable(compound_head(page), PG_locked); + return wait_on_page_lock(page, TASK_KILLABLE); } extern wait_queue_head_t *page_waitqueue(struct page *page); @@ -494,8 +504,7 @@ static inline void wake_up_page(struct page *page, int bit) */ static inline void wait_on_page_locked(struct page *page) { - if (PageLocked(page)) - wait_on_page_bit(compound_head(page), PG_locked); + wait_on_page_lock(page, TASK_UNINTERRUPTIBLE); } /* diff --git a/include/linux/wait.h b/include/linux/wait.h index c3ff74d..524cd54 100644 --- a/include/linux/wait.h +++ b/include/linux/wait.h @@ -198,7 +198,7 @@ __remove_wait_queue(wait_queue_head_t *head, wait_queue_t *old) typedef int wait_bit_action_f(struct wait_bit_key *, int mode); void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key); -void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key); +int __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key); void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void *key); void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr); void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr); diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index 5a81ab4..18b8398 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -81,6 +81,7 @@ #define __def_pageflag_names \ {1UL << PG_locked, "locked" }, \ + {1UL << PG_contended, "contended" }, \ {1UL << PG_error, "error" }, \ {1UL << PG_referenced, "referenced" }, \ {1UL << PG_uptodate, "uptodate" }, \ diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c index f15d6b6..46dcc42 100644 --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -62,18 +62,23 @@ EXPORT_SYMBOL(remove_wait_queue); * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns * zero in this (rare) case, and we handle it by continuing to scan the queue. */ -static void __wake_up_common(wait_queue_head_t *q, unsigned int mode, +static int __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, int wake_flags, void *key) { wait_queue_t *curr, *next; + int woken = 0; list_for_each_entry_safe(curr, next, &q->task_list, task_list) { unsigned flags = curr->flags; - if (curr->func(curr, mode, wake_flags, key) && - (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) - break; + if (curr->func(curr, mode, wake_flags, key)) { + woken++; + if ((flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) + break; + } } + + return woken; } /** @@ -106,9 +111,9 @@ void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr) } EXPORT_SYMBOL_GPL(__wake_up_locked); -void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key) +int __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key) { - __wake_up_common(q, mode, 1, 0, key); + return __wake_up_common(q, mode, 1, 0, key); } EXPORT_SYMBOL_GPL(__wake_up_locked_key); diff --git a/mm/filemap.c b/mm/filemap.c index 8a287df..d3e3203 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -847,15 +847,18 @@ EXPORT_SYMBOL_GPL(add_page_wait_queue); * The mb is necessary to enforce ordering between the clear_bit and the read * of the waitqueue (to avoid SMP races with a parallel wait_on_page_locked()). */ -void unlock_page(struct page *page) +void __unlock_page(struct page *page) { - page = compound_head(page); - VM_BUG_ON_PAGE(!PageLocked(page), page); - clear_bit_unlock(PG_locked, &page->flags); - smp_mb__after_atomic(); - wake_up_page(page, PG_locked); + struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(&page->flags, PG_locked); + wait_queue_head_t *wq = page_waitqueue(page); + unsigned long flags; + + spin_lock_irqsave(&wq->lock, flags); + if (!__wake_up_locked_key(wq, TASK_NORMAL, &key)) + ClearPageContended(page); + spin_unlock_irqrestore(&wq->lock, flags); } -EXPORT_SYMBOL(unlock_page); +EXPORT_SYMBOL(__unlock_page); /** * end_page_writeback - end writeback against a page @@ -908,6 +911,44 @@ void page_endio(struct page *page, bool is_write, int err) } EXPORT_SYMBOL_GPL(page_endio); +static int lock_page_wait(struct wait_bit_key *word, int mode) +{ + struct page *page = container_of(word->flags, struct page, flags); + + /* + * Set PG_contended now that we're enqueued on the waitqueue, this + * way we cannot race against ClearPageContended in wakeup. + */ + if (!PageContended(page)) { + SetPageContended(page); + + /* + * If we set PG_contended, we must order against any later list + * addition and/or a later PG_lock load. + * + * [unlock] [wait] + * + * clear PG_locked set PG_contended + * test PG_contended test-and-set PG_locked + * + * and + * + * [unlock] [wait] + * + * test PG_contended set PG_contended + * wakeup add + * clear PG_contended sleep + * + * In either case we must not reorder nor sleep before + * PG_contended is visible. + */ + smp_mb__after_atomic(); + return 0; + } + + return bit_wait_io(word, mode); +} + /** * __lock_page - get a lock on the page, assuming we need to sleep to get it * @page: the page to lock @@ -917,7 +958,7 @@ void __lock_page(struct page *page) struct page *page_head = compound_head(page); DEFINE_WAIT_BIT(wait, &page_head->flags, PG_locked); - __wait_on_bit_lock(page_waitqueue(page_head), &wait, bit_wait_io, + __wait_on_bit_lock(page_waitqueue(page_head), &wait, lock_page_wait, TASK_UNINTERRUPTIBLE); } EXPORT_SYMBOL(__lock_page); @@ -928,10 +969,22 @@ int __lock_page_killable(struct page *page) DEFINE_WAIT_BIT(wait, &page_head->flags, PG_locked); return __wait_on_bit_lock(page_waitqueue(page_head), &wait, - bit_wait_io, TASK_KILLABLE); + lock_page_wait, TASK_KILLABLE); } EXPORT_SYMBOL_GPL(__lock_page_killable); +int wait_on_page_lock(struct page *page, int mode) +{ + struct page __always_unused *__page = (page = compound_head(page)); + DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); + + if (!PageLocked(page)) + return 0; + + return __wait_on_bit(page_waitqueue(page), &wait, lock_page_wait, mode); +} +EXPORT_SYMBOL(wait_on_page_lock); + /* * Return values: * 1 - page is locked; mmap_sem is still held. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-27 14:34 ` Peter Zijlstra @ 2016-09-27 15:08 ` Nicholas Piggin 2016-09-27 16:31 ` Linus Torvalds 2016-09-28 10:45 ` Mel Gorman 2 siblings, 0 replies; 42+ messages in thread From: Nicholas Piggin @ 2016-09-27 15:08 UTC (permalink / raw) To: Peter Zijlstra Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel, linux-mm On Tue, 27 Sep 2016 16:34:26 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, Sep 27, 2016 at 09:31:04AM +0100, Mel Gorman wrote: > > page_waitqueue() has been a hazard for years. I think the last attempt to > > fix it was back in 2014 http://www.spinics.net/lists/linux-mm/msg73207.html > > > > The patch is heavily derived from work by Nick Piggin who noticed the years > > before that. I think that was the last version I posted and the changelog > > includes profile data. I don't have an exact reference why it was rejected > > but a consistent piece of feedback was that it was very complex for the > > level of impact it had. > > Right, I never really liked that patch. In any case, the below seems to > boot, although the lock_page_wait() thing did get my brain in a bit of a > twist. Doing explicit loops with PG_contended inside wq->lock would be > more obvious but results in much more code. > > We could muck about with PG_contended naming/placement if any of this > shows benefit. > > It does boot on my x86_64 and builds a kernel, so it must be perfect ;-) > > --- > include/linux/page-flags.h | 2 ++ > include/linux/pagemap.h | 21 +++++++++---- > include/linux/wait.h | 2 +- > include/trace/events/mmflags.h | 1 + > kernel/sched/wait.c | 17 ++++++---- > mm/filemap.c | 71 ++++++++++++++++++++++++++++++++++++------ > 6 files changed, 92 insertions(+), 22 deletions(-) > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 74e4dda..0ed3900 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -73,6 +73,7 @@ > */ > enum pageflags { > PG_locked, /* Page is locked. Don't touch. */ > + PG_contended, /* Page lock is contended. */ > PG_error, > PG_referenced, > PG_uptodate, > @@ -253,6 +254,7 @@ static inline int TestClearPage##uname(struct page *page) { return 0; } > TESTSETFLAG_FALSE(uname) TESTCLEARFLAG_FALSE(uname) > > __PAGEFLAG(Locked, locked, PF_NO_TAIL) > +PAGEFLAG(Contended, contended, PF_NO_TAIL) > PAGEFLAG(Error, error, PF_NO_COMPOUND) TESTCLEARFLAG(Error, error, PF_NO_COMPOUND) > PAGEFLAG(Referenced, referenced, PF_HEAD) > TESTCLEARFLAG(Referenced, referenced, PF_HEAD) > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > index 66a1260..3b38a96 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -417,7 +417,7 @@ extern void __lock_page(struct page *page); > extern int __lock_page_killable(struct page *page); > extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm, > unsigned int flags); > -extern void unlock_page(struct page *page); > +extern void __unlock_page(struct page *page); > > static inline int trylock_page(struct page *page) > { > @@ -448,6 +448,16 @@ static inline int lock_page_killable(struct page *page) > return 0; > } > > +static inline void unlock_page(struct page *page) > +{ > + page = compound_head(page); > + VM_BUG_ON_PAGE(!PageLocked(page), page); > + clear_bit_unlock(PG_locked, &page->flags); > + smp_mb__after_atomic(); If the wakeup does not do a reorderable waitqueue_active check outside the lock, why is this barrier needed? > + if (PageContended(page)) > + __unlock_page(page); > +} > + > @@ -62,18 +62,23 @@ EXPORT_SYMBOL(remove_wait_queue); > * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns > * zero in this (rare) case, and we handle it by continuing to scan the queue. > */ > -static void __wake_up_common(wait_queue_head_t *q, unsigned int mode, > +static int __wake_up_common(wait_queue_head_t *q, unsigned int mode, > int nr_exclusive, int wake_flags, void *key) > { > wait_queue_t *curr, *next; > + int woken = 0; > > list_for_each_entry_safe(curr, next, &q->task_list, task_list) { > unsigned flags = curr->flags; > > - if (curr->func(curr, mode, wake_flags, key) && > - (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) > - break; > + if (curr->func(curr, mode, wake_flags, key)) { > + woken++; > + if ((flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) > + break; > + } > } > + > + return woken; > } > > /** > @@ -106,9 +111,9 @@ void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr) > } > EXPORT_SYMBOL_GPL(__wake_up_locked); > > -void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key) > +int __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key) > { > - __wake_up_common(q, mode, 1, 0, key); > + return __wake_up_common(q, mode, 1, 0, key); > } > EXPORT_SYMBOL_GPL(__wake_up_locked_key); > > diff --git a/mm/filemap.c b/mm/filemap.c > index 8a287df..d3e3203 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -847,15 +847,18 @@ EXPORT_SYMBOL_GPL(add_page_wait_queue); > * The mb is necessary to enforce ordering between the clear_bit and the read > * of the waitqueue (to avoid SMP races with a parallel wait_on_page_locked()). > */ > -void unlock_page(struct page *page) > +void __unlock_page(struct page *page) > { > - page = compound_head(page); > - VM_BUG_ON_PAGE(!PageLocked(page), page); > - clear_bit_unlock(PG_locked, &page->flags); > - smp_mb__after_atomic(); > - wake_up_page(page, PG_locked); > + struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(&page->flags, PG_locked); > + wait_queue_head_t *wq = page_waitqueue(page); > + unsigned long flags; > + > + spin_lock_irqsave(&wq->lock, flags); > + if (!__wake_up_locked_key(wq, TASK_NORMAL, &key)) > + ClearPageContended(page); > + spin_unlock_irqrestore(&wq->lock, flags); This approach of just counting wakeups doesn't work when you use the waiter bit for other than just PG_locked (e.g., PG_writeback). I guess that's why I didn't call it contented too: it really should be just about whether it is on the waitqueue or not. > } > -EXPORT_SYMBOL(unlock_page); > +EXPORT_SYMBOL(__unlock_page); > > /** > * end_page_writeback - end writeback against a page > @@ -908,6 +911,44 @@ void page_endio(struct page *page, bool is_write, int err) > } > EXPORT_SYMBOL_GPL(page_endio); > > +static int lock_page_wait(struct wait_bit_key *word, int mode) > +{ > + struct page *page = container_of(word->flags, struct page, flags); > + > + /* > + * Set PG_contended now that we're enqueued on the waitqueue, this > + * way we cannot race against ClearPageContended in wakeup. > + */ > + if (!PageContended(page)) { > + SetPageContended(page); > + > + /* > + * If we set PG_contended, we must order against any later list > + * addition and/or a later PG_lock load. > + * > + * [unlock] [wait] > + * > + * clear PG_locked set PG_contended > + * test PG_contended test-and-set PG_locked > + * > + * and > + * > + * [unlock] [wait] > + * > + * test PG_contended set PG_contended > + * wakeup add > + * clear PG_contended sleep > + * > + * In either case we must not reorder nor sleep before > + * PG_contended is visible. > + */ > + smp_mb__after_atomic(); > + return 0; > + } > + > + return bit_wait_io(word, mode); > +} I probably slightly prefer how my patch maintains the bit, but don't really fussed how it works exactly as long as it can support waiting on other bits as well. The more interesting question is the memory barriers -- see my other post. Thanks, Nick -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-27 14:34 ` Peter Zijlstra 2016-09-27 15:08 ` Nicholas Piggin @ 2016-09-27 16:31 ` Linus Torvalds 2016-09-27 16:49 ` Peter Zijlstra 2016-09-28 10:45 ` Mel Gorman 2 siblings, 1 reply; 42+ messages in thread From: Linus Torvalds @ 2016-09-27 16:31 UTC (permalink / raw) To: Peter Zijlstra Cc: Mel Gorman, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel, linux-mm On Tue, Sep 27, 2016 at 7:34 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > Right, I never really liked that patch. In any case, the below seems to > boot, although the lock_page_wait() thing did get my brain in a bit of a > twist. Doing explicit loops with PG_contended inside wq->lock would be > more obvious but results in much more code. > > We could muck about with PG_contended naming/placement if any of this > shows benefit. > > It does boot on my x86_64 and builds a kernel, so it must be perfect ;-) This patch looks very much like what I was thinking of. Except you made that bit clearing more complicated than I would have done. I see why you did it (it's hard to clear the bit when the wait-queue that is associated with it can be associated with multiple pages), but I think it would be perfectly fine to just not even try to make the "contended" bit be an exact bit. You can literally leave it set (giving us the existing behavior), but then when you hit the __unlock_page() case, and you look up the page_waitqueue(), and find that the waitqueue is empty, *then* you clear it. So you'd end up going through the slow path one too many times per page, but considering that right now we *always* go through that slow-path, and the "one too many times" is "two times per IO rather than just once", it really is not a performance issue. I'd rather go for simple and robust. I get a bit nervous when I see you being so careful in counting the number of waiters that match the page key - if any of that code ever gets it wrong (because two different pages that shared a waitqueue happen to race at just the right time), and the bit gets cleared too early, you will get some *very* hard-to-debug problems. So I actually think your patch is a bit too clever. But maybe there's a reason for that that I just don't see. My gut feel is that your patch is good. .. and hey, it booted and compiled the kernel, so as you say, it must be perfect. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-27 16:31 ` Linus Torvalds @ 2016-09-27 16:49 ` Peter Zijlstra 0 siblings, 0 replies; 42+ messages in thread From: Peter Zijlstra @ 2016-09-27 16:49 UTC (permalink / raw) To: Linus Torvalds Cc: Mel Gorman, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel, linux-mm On Tue, Sep 27, 2016 at 09:31:54AM -0700, Linus Torvalds wrote: > On Tue, Sep 27, 2016 at 7:34 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > > > Right, I never really liked that patch. In any case, the below seems to > > boot, although the lock_page_wait() thing did get my brain in a bit of a > > twist. Doing explicit loops with PG_contended inside wq->lock would be > > more obvious but results in much more code. > > > > We could muck about with PG_contended naming/placement if any of this > > shows benefit. > > > > It does boot on my x86_64 and builds a kernel, so it must be perfect ;-) > > This patch looks very much like what I was thinking of. Except you > made that bit clearing more complicated than I would have done. > > I see why you did it (it's hard to clear the bit when the wait-queue > that is associated with it can be associated with multiple pages), but > I think it would be perfectly fine to just not even try to make the > "contended" bit be an exact bit. You can literally leave it set > (giving us the existing behavior), but then when you hit the > __unlock_page() case, and you look up the page_waitqueue(), and find > that the waitqueue is empty, *then* you clear it. > > So you'd end up going through the slow path one too many times per > page, but considering that right now we *always* go through that > slow-path, and the "one too many times" is "two times per IO rather > than just once", it really is not a performance issue. I'd rather go > for simple and robust. My clear already does that same thing, once we find nobody to wake up we clear, this means we did the waitqueue lookup once in vain. But yes it allows the bit to be cleared while there are still waiters (for other bits) on the waitqueue. The other benefit of doing what you suggest (and Nick does) is that you can then indeed use the bit for other waitqueue users like PG_writeback. I never really got that part of the patch as it wasn't spelled out, but it does make sense now that I understand the intent. And assuming collisions are rare, that works just fine. > I get a bit nervous when I see you being so careful in counting the > number of waiters that match the page key - if any of that code ever > gets it wrong (because two different pages that shared a waitqueue > happen to race at just the right time), and the bit gets cleared too > early, you will get some *very* hard-to-debug problems. Right, so I don't care about the actual number, just it being 0 or not. Maybe I should've returned bool. But yes, its a tad more tricky than I'd liked, mostly because I was lazy. Doing the SetPageContended under the wq->lock would've made things more obvious. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-27 14:34 ` Peter Zijlstra 2016-09-27 15:08 ` Nicholas Piggin 2016-09-27 16:31 ` Linus Torvalds @ 2016-09-28 10:45 ` Mel Gorman 2016-09-28 11:11 ` Peter Zijlstra 2 siblings, 1 reply; 42+ messages in thread From: Mel Gorman @ 2016-09-28 10:45 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel, linux-mm On Tue, Sep 27, 2016 at 04:34:26PM +0200, Peter Zijlstra wrote: > On Tue, Sep 27, 2016 at 09:31:04AM +0100, Mel Gorman wrote: > > page_waitqueue() has been a hazard for years. I think the last attempt to > > fix it was back in 2014 http://www.spinics.net/lists/linux-mm/msg73207.html > > > > The patch is heavily derived from work by Nick Piggin who noticed the years > > before that. I think that was the last version I posted and the changelog > > includes profile data. I don't have an exact reference why it was rejected > > but a consistent piece of feedback was that it was very complex for the > > level of impact it had. > > Right, I never really liked that patch. In any case, the below seems to > boot, although the lock_page_wait() thing did get my brain in a bit of a > twist. Doing explicit loops with PG_contended inside wq->lock would be > more obvious but results in much more code. > > We could muck about with PG_contended naming/placement if any of this > shows benefit. > > It does boot on my x86_64 and builds a kernel, so it must be perfect ;-) > Heh. tldr: Other than 32-bit vs 64-bit, I could not find anything obviously wrong. > --- > include/linux/page-flags.h | 2 ++ > include/linux/pagemap.h | 21 +++++++++---- > include/linux/wait.h | 2 +- > include/trace/events/mmflags.h | 1 + > kernel/sched/wait.c | 17 ++++++---- > mm/filemap.c | 71 ++++++++++++++++++++++++++++++++++++------ > 6 files changed, 92 insertions(+), 22 deletions(-) > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 74e4dda..0ed3900 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -73,6 +73,7 @@ > */ > enum pageflags { > PG_locked, /* Page is locked. Don't touch. */ > + PG_contended, /* Page lock is contended. */ > PG_error, > PG_referenced, > PG_uptodate, Naming has been covered by Nick. You may run into the same problem with 32-bit and available page flags. I didn't work out the remaining number of flags but did you check 32-bit is ok? If not, you may need to take a similar approach that I did that says "there are always waiters and use the slow path on 32-bit". > @@ -253,6 +254,7 @@ static inline int TestClearPage##uname(struct page *page) { return 0; } > TESTSETFLAG_FALSE(uname) TESTCLEARFLAG_FALSE(uname) > > __PAGEFLAG(Locked, locked, PF_NO_TAIL) > +PAGEFLAG(Contended, contended, PF_NO_TAIL) > PAGEFLAG(Error, error, PF_NO_COMPOUND) TESTCLEARFLAG(Error, error, PF_NO_COMPOUND) > PAGEFLAG(Referenced, referenced, PF_HEAD) > TESTCLEARFLAG(Referenced, referenced, PF_HEAD) > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > index 66a1260..3b38a96 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -417,7 +417,7 @@ extern void __lock_page(struct page *page); > extern int __lock_page_killable(struct page *page); > extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm, > unsigned int flags); > -extern void unlock_page(struct page *page); > +extern void __unlock_page(struct page *page); > > static inline int trylock_page(struct page *page) > { > @@ -448,6 +448,16 @@ static inline int lock_page_killable(struct page *page) > return 0; > } > > +static inline void unlock_page(struct page *page) > +{ > + page = compound_head(page); > + VM_BUG_ON_PAGE(!PageLocked(page), page); > + clear_bit_unlock(PG_locked, &page->flags); > + smp_mb__after_atomic(); > + if (PageContended(page)) > + __unlock_page(page); > +} > + The main race to be concerned with is PageContended being set after the page is unlocked and missing a wakeup here. While you explain the protocol later, it's worth referencing it here. > /* > * lock_page_or_retry - Lock the page, unless this would block and the > * caller indicated that it can handle a retry. > @@ -472,11 +482,11 @@ extern int wait_on_page_bit_killable(struct page *page, int bit_nr); > extern int wait_on_page_bit_killable_timeout(struct page *page, > int bit_nr, unsigned long timeout); > > +extern int wait_on_page_lock(struct page *page, int mode); > + > static inline int wait_on_page_locked_killable(struct page *page) > { > - if (!PageLocked(page)) > - return 0; > - return wait_on_page_bit_killable(compound_head(page), PG_locked); > + return wait_on_page_lock(page, TASK_KILLABLE); > } > Ok, I raised an eyebrow at compound_head but it's covered in the helper. > extern wait_queue_head_t *page_waitqueue(struct page *page); > @@ -494,8 +504,7 @@ static inline void wake_up_page(struct page *page, int bit) > */ > static inline void wait_on_page_locked(struct page *page) > { > - if (PageLocked(page)) > - wait_on_page_bit(compound_head(page), PG_locked); > + wait_on_page_lock(page, TASK_UNINTERRUPTIBLE); > } > > /* > diff --git a/include/linux/wait.h b/include/linux/wait.h > index c3ff74d..524cd54 100644 > --- a/include/linux/wait.h > +++ b/include/linux/wait.h > @@ -198,7 +198,7 @@ __remove_wait_queue(wait_queue_head_t *head, wait_queue_t *old) > > typedef int wait_bit_action_f(struct wait_bit_key *, int mode); > void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key); > -void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key); > +int __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key); > void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void *key); > void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr); > void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr); > diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h > index 5a81ab4..18b8398 100644 > --- a/include/trace/events/mmflags.h > +++ b/include/trace/events/mmflags.h > @@ -81,6 +81,7 @@ > > #define __def_pageflag_names \ > {1UL << PG_locked, "locked" }, \ > + {1UL << PG_contended, "contended" }, \ > {1UL << PG_error, "error" }, \ > {1UL << PG_referenced, "referenced" }, \ > {1UL << PG_uptodate, "uptodate" }, \ > diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c > index f15d6b6..46dcc42 100644 > --- a/kernel/sched/wait.c > +++ b/kernel/sched/wait.c > @@ -62,18 +62,23 @@ EXPORT_SYMBOL(remove_wait_queue); > * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns > * zero in this (rare) case, and we handle it by continuing to scan the queue. > */ > -static void __wake_up_common(wait_queue_head_t *q, unsigned int mode, > +static int __wake_up_common(wait_queue_head_t *q, unsigned int mode, > int nr_exclusive, int wake_flags, void *key) > { > wait_queue_t *curr, *next; > + int woken = 0; > > list_for_each_entry_safe(curr, next, &q->task_list, task_list) { > unsigned flags = curr->flags; > > - if (curr->func(curr, mode, wake_flags, key) && > - (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) > - break; > + if (curr->func(curr, mode, wake_flags, key)) { > + woken++; > + if ((flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) > + break; > + } > } > + > + return woken; > } > ok. > /** > @@ -106,9 +111,9 @@ void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr) > } > EXPORT_SYMBOL_GPL(__wake_up_locked); > > -void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key) > +int __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key) > { > - __wake_up_common(q, mode, 1, 0, key); > + return __wake_up_common(q, mode, 1, 0, key); > } > EXPORT_SYMBOL_GPL(__wake_up_locked_key); > > diff --git a/mm/filemap.c b/mm/filemap.c > index 8a287df..d3e3203 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -847,15 +847,18 @@ EXPORT_SYMBOL_GPL(add_page_wait_queue); > * The mb is necessary to enforce ordering between the clear_bit and the read > * of the waitqueue (to avoid SMP races with a parallel wait_on_page_locked()). > */ > -void unlock_page(struct page *page) > +void __unlock_page(struct page *page) > { > - page = compound_head(page); > - VM_BUG_ON_PAGE(!PageLocked(page), page); > - clear_bit_unlock(PG_locked, &page->flags); > - smp_mb__after_atomic(); > - wake_up_page(page, PG_locked); > + struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(&page->flags, PG_locked); > + wait_queue_head_t *wq = page_waitqueue(page); > + unsigned long flags; > + > + spin_lock_irqsave(&wq->lock, flags); > + if (!__wake_up_locked_key(wq, TASK_NORMAL, &key)) > + ClearPageContended(page); > + spin_unlock_irqrestore(&wq->lock, flags); > } > -EXPORT_SYMBOL(unlock_page); > +EXPORT_SYMBOL(__unlock_page); > The function name is questionable. It used to unlock_page but now it's handling the wakeup of waiters. That aside, the wq->lock in itself does not protect the PageContended bit but I couldn't see a case where we lost a wakeup in the following sequence either. unlock_page lock_page prepare_wait lock_wq wake ClearPageContended unlock_wq SetPageContended Otherwise the page_waitqueue deferrals to the slowpath look ok. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-28 10:45 ` Mel Gorman @ 2016-09-28 11:11 ` Peter Zijlstra 2016-09-28 16:10 ` Linus Torvalds 0 siblings, 1 reply; 42+ messages in thread From: Peter Zijlstra @ 2016-09-28 11:11 UTC (permalink / raw) To: Mel Gorman Cc: Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel, linux-mm On Wed, Sep 28, 2016 at 11:45:00AM +0100, Mel Gorman wrote: > tldr: Other than 32-bit vs 64-bit, I could not find anything obviously wrong. Thanks, meanwhile I redid the patch based on the feedback from Nick and Linus. I know I ignored the 32bit issue, I figured we'd first see if it helps at all before mucking about with that. In any case, I don't mind doing a PG_waiters and also using it for the PG_writeback thing. Just wasn't what I was thinking of when doing that patch. Re the naming of __unlock_page(), its the slow path continuation of unlock_page(). But I'm not too bothered if people want to change the name. So the below boots and builds a kernel and must be equally perfect, then again x86 has no-op smp_mb__{before,after}_atomic() so the lack of barriers will not affect anything much. PowerPC, ARM and MIPS (among others) will want testing. --- include/linux/page-flags.h | 2 ++ include/linux/pagemap.h | 25 +++++++++---- include/trace/events/mmflags.h | 1 + mm/filemap.c | 80 +++++++++++++++++++++++++++++++++++++----- 4 files changed, 93 insertions(+), 15 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 74e4dda..0ed3900 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -73,6 +73,7 @@ */ enum pageflags { PG_locked, /* Page is locked. Don't touch. */ + PG_contended, /* Page lock is contended. */ PG_error, PG_referenced, PG_uptodate, @@ -253,6 +254,7 @@ static inline int TestClearPage##uname(struct page *page) { return 0; } TESTSETFLAG_FALSE(uname) TESTCLEARFLAG_FALSE(uname) __PAGEFLAG(Locked, locked, PF_NO_TAIL) +PAGEFLAG(Contended, contended, PF_NO_TAIL) PAGEFLAG(Error, error, PF_NO_COMPOUND) TESTCLEARFLAG(Error, error, PF_NO_COMPOUND) PAGEFLAG(Referenced, referenced, PF_HEAD) TESTCLEARFLAG(Referenced, referenced, PF_HEAD) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 66a1260..c8b8651 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -417,7 +417,7 @@ extern void __lock_page(struct page *page); extern int __lock_page_killable(struct page *page); extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm, unsigned int flags); -extern void unlock_page(struct page *page); +extern void __unlock_page(struct page *page); static inline int trylock_page(struct page *page) { @@ -448,6 +448,20 @@ static inline int lock_page_killable(struct page *page) return 0; } +static inline void unlock_page(struct page *page) +{ + page = compound_head(page); + VM_BUG_ON_PAGE(!PageLocked(page), page); + clear_bit_unlock(PG_locked, &page->flags); + /* + * Since PG_locked and PG_contended are in the same word, Program-Order + * ensures the load of PG_contended must not observe a value earlier + * than our clear_bit() store. See lock_page_wait(). + */ + if (PageContended(page)) + __unlock_page(page); +} + /* * lock_page_or_retry - Lock the page, unless this would block and the * caller indicated that it can handle a retry. @@ -472,11 +486,11 @@ extern int wait_on_page_bit_killable(struct page *page, int bit_nr); extern int wait_on_page_bit_killable_timeout(struct page *page, int bit_nr, unsigned long timeout); +extern int wait_on_page_lock(struct page *page, int mode); + static inline int wait_on_page_locked_killable(struct page *page) { - if (!PageLocked(page)) - return 0; - return wait_on_page_bit_killable(compound_head(page), PG_locked); + return wait_on_page_lock(page, TASK_KILLABLE); } extern wait_queue_head_t *page_waitqueue(struct page *page); @@ -494,8 +508,7 @@ static inline void wake_up_page(struct page *page, int bit) */ static inline void wait_on_page_locked(struct page *page) { - if (PageLocked(page)) - wait_on_page_bit(compound_head(page), PG_locked); + wait_on_page_lock(page, TASK_UNINTERRUPTIBLE); } /* diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index 5a81ab4..18b8398 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -81,6 +81,7 @@ #define __def_pageflag_names \ {1UL << PG_locked, "locked" }, \ + {1UL << PG_contended, "contended" }, \ {1UL << PG_error, "error" }, \ {1UL << PG_referenced, "referenced" }, \ {1UL << PG_uptodate, "uptodate" }, \ diff --git a/mm/filemap.c b/mm/filemap.c index 8a287df..79aab60 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -847,15 +847,17 @@ EXPORT_SYMBOL_GPL(add_page_wait_queue); * The mb is necessary to enforce ordering between the clear_bit and the read * of the waitqueue (to avoid SMP races with a parallel wait_on_page_locked()). */ -void unlock_page(struct page *page) +void __unlock_page(struct page *page) { - page = compound_head(page); - VM_BUG_ON_PAGE(!PageLocked(page), page); - clear_bit_unlock(PG_locked, &page->flags); - smp_mb__after_atomic(); - wake_up_page(page, PG_locked); + struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(&page->flags, PG_locked); + wait_queue_head_t *wq = page_waitqueue(page); + + if (waitqueue_active(wq)) + __wake_up(wq, TASK_NORMAL, 1, &key); + else + ClearPageContended(page); } -EXPORT_SYMBOL(unlock_page); +EXPORT_SYMBOL(__unlock_page); /** * end_page_writeback - end writeback against a page @@ -908,6 +910,54 @@ void page_endio(struct page *page, bool is_write, int err) } EXPORT_SYMBOL_GPL(page_endio); +static int lock_page_wait(struct wait_bit_key *word, int mode) +{ + struct page *page = container_of(word->flags, struct page, flags); + + /* + * We cannot go sleep without having PG_contended set. This would mean + * nobody would issue a wakeup and we'd be stuck. + */ + if (!PageContended(page)) { + SetPageContended(page); + + /* + * There are two orderings of importance: + * + * 1) + * + * [unlock] [wait] + * + * clear PG_locked set PG_contended + * test PG_contended test (and-set) PG_locked + * + * Since these are on the same word, and the clear/set + * operation are atomic, they are ordered against one another. + * Program-Order further constraints a CPU from speculating the + * later load to not be earlier than the RmW. So this doesn't + * need an explicit barrier. Also see unlock_page(). + * + * 2) + * + * [unlock] [wait] + * + * test PG_contended set PG_contended + * wakeup/remove add + * clear PG_contended trylock + * sleep + * + * In this case however, we must ensure PG_contended is set + * before we add ourselves to the list, such that unlock() must + * either see PG_contended or we see its clear. + */ + smp_mb__after_atomic(); + + return 0; + } + + return bit_wait_io(word, mode); +} + /** * __lock_page - get a lock on the page, assuming we need to sleep to get it * @page: the page to lock @@ -917,7 +967,7 @@ void __lock_page(struct page *page) struct page *page_head = compound_head(page); DEFINE_WAIT_BIT(wait, &page_head->flags, PG_locked); - __wait_on_bit_lock(page_waitqueue(page_head), &wait, bit_wait_io, + __wait_on_bit_lock(page_waitqueue(page_head), &wait, lock_page_wait, TASK_UNINTERRUPTIBLE); } EXPORT_SYMBOL(__lock_page); @@ -928,10 +978,22 @@ int __lock_page_killable(struct page *page) DEFINE_WAIT_BIT(wait, &page_head->flags, PG_locked); return __wait_on_bit_lock(page_waitqueue(page_head), &wait, - bit_wait_io, TASK_KILLABLE); + lock_page_wait, TASK_KILLABLE); } EXPORT_SYMBOL_GPL(__lock_page_killable); +int wait_on_page_lock(struct page *page, int mode) +{ + struct page __always_unused *__page = (page = compound_head(page)); + DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); + + if (!PageLocked(page)) + return 0; + + return __wait_on_bit(page_waitqueue(page), &wait, lock_page_wait, mode); +} +EXPORT_SYMBOL(wait_on_page_lock); + /* * Return values: * 1 - page is locked; mmap_sem is still held. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-28 11:11 ` Peter Zijlstra @ 2016-09-28 16:10 ` Linus Torvalds 2016-09-29 13:08 ` Peter Zijlstra 0 siblings, 1 reply; 42+ messages in thread From: Linus Torvalds @ 2016-09-28 16:10 UTC (permalink / raw) To: Peter Zijlstra Cc: Mel Gorman, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel, linux-mm On Wed, Sep 28, 2016 at 4:11 AM, Peter Zijlstra <peterz@infradead.org> wrote: > -void unlock_page(struct page *page) > +void __unlock_page(struct page *page) > { > + struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(&page->flags, PG_locked); > + wait_queue_head_t *wq = page_waitqueue(page); > + > + if (waitqueue_active(wq)) > + __wake_up(wq, TASK_NORMAL, 1, &key); > + else > + ClearPageContended(page); > } > +EXPORT_SYMBOL(__unlock_page); I think the above needs to be protected. Something like spin_lock_irqsave(&q->lock, flags); if (waitqueue_active(wq)) __wake_up_locked(wq, TASK_NORMAL, 1, &key); else ClearPageContended(page); spin_unlock_irqrestore(&q->lock, flags); because otherwise a new waiter could come in and add itself to the wait-queue, and then set the bit, and now we clear it (because we didn't see the new waiter). The *waiter* doesn't need any extra locking, because doing add_wait_queue(..); SetPageContended(page); is not racy (the add_wait_queue() will now already guarantee that nobody else clears the bit). Hmm? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-28 16:10 ` Linus Torvalds @ 2016-09-29 13:08 ` Peter Zijlstra 2016-10-03 10:47 ` Mel Gorman 0 siblings, 1 reply; 42+ messages in thread From: Peter Zijlstra @ 2016-09-29 13:08 UTC (permalink / raw) To: Linus Torvalds Cc: Mel Gorman, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel, linux-mm On Wed, Sep 28, 2016 at 09:10:29AM -0700, Linus Torvalds wrote: > I think the above needs to be protected. Something like > > spin_lock_irqsave(&q->lock, flags); > if (waitqueue_active(wq)) > __wake_up_locked(wq, TASK_NORMAL, 1, &key); > else > ClearPageContended(page); > spin_unlock_irqrestore(&q->lock, flags); > > because otherwise a new waiter could come in and add itself to the > wait-queue, and then set the bit, and now we clear it (because we > didn't see the new waiter). > > The *waiter* doesn't need any extra locking, because doing > > add_wait_queue(..); > SetPageContended(page); > > is not racy (the add_wait_queue() will now already guarantee that > nobody else clears the bit). > > Hmm? Yes. I got my brain in a complete twist, but you're right, that is indeed required. Here's a new version with hopefully clearer comments. Same caveat about 32bit, naming etc.. --- include/linux/page-flags.h | 2 + include/linux/pagemap.h | 25 ++++++++--- include/trace/events/mmflags.h | 1 + mm/filemap.c | 94 ++++++++++++++++++++++++++++++++++++++---- 4 files changed, 107 insertions(+), 15 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 74e4dda..0ed3900 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -73,6 +73,7 @@ */ enum pageflags { PG_locked, /* Page is locked. Don't touch. */ + PG_contended, /* Page lock is contended. */ PG_error, PG_referenced, PG_uptodate, @@ -253,6 +254,7 @@ static inline int TestClearPage##uname(struct page *page) { return 0; } TESTSETFLAG_FALSE(uname) TESTCLEARFLAG_FALSE(uname) __PAGEFLAG(Locked, locked, PF_NO_TAIL) +PAGEFLAG(Contended, contended, PF_NO_TAIL) PAGEFLAG(Error, error, PF_NO_COMPOUND) TESTCLEARFLAG(Error, error, PF_NO_COMPOUND) PAGEFLAG(Referenced, referenced, PF_HEAD) TESTCLEARFLAG(Referenced, referenced, PF_HEAD) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 66a1260..c8b8651 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -417,7 +417,7 @@ extern void __lock_page(struct page *page); extern int __lock_page_killable(struct page *page); extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm, unsigned int flags); -extern void unlock_page(struct page *page); +extern void __unlock_page(struct page *page); static inline int trylock_page(struct page *page) { @@ -448,6 +448,20 @@ static inline int lock_page_killable(struct page *page) return 0; } +static inline void unlock_page(struct page *page) +{ + page = compound_head(page); + VM_BUG_ON_PAGE(!PageLocked(page), page); + clear_bit_unlock(PG_locked, &page->flags); + /* + * Since PG_locked and PG_contended are in the same word, Program-Order + * ensures the load of PG_contended must not observe a value earlier + * than our clear_bit() store. + */ + if (PageContended(page)) + __unlock_page(page); +} + /* * lock_page_or_retry - Lock the page, unless this would block and the * caller indicated that it can handle a retry. @@ -472,11 +486,11 @@ extern int wait_on_page_bit_killable(struct page *page, int bit_nr); extern int wait_on_page_bit_killable_timeout(struct page *page, int bit_nr, unsigned long timeout); +extern int wait_on_page_lock(struct page *page, int mode); + static inline int wait_on_page_locked_killable(struct page *page) { - if (!PageLocked(page)) - return 0; - return wait_on_page_bit_killable(compound_head(page), PG_locked); + return wait_on_page_lock(page, TASK_KILLABLE); } extern wait_queue_head_t *page_waitqueue(struct page *page); @@ -494,8 +508,7 @@ static inline void wake_up_page(struct page *page, int bit) */ static inline void wait_on_page_locked(struct page *page) { - if (PageLocked(page)) - wait_on_page_bit(compound_head(page), PG_locked); + wait_on_page_lock(page, TASK_UNINTERRUPTIBLE); } /* diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index 5a81ab4..18b8398 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -81,6 +81,7 @@ #define __def_pageflag_names \ {1UL << PG_locked, "locked" }, \ + {1UL << PG_contended, "contended" }, \ {1UL << PG_error, "error" }, \ {1UL << PG_referenced, "referenced" }, \ {1UL << PG_uptodate, "uptodate" }, \ diff --git a/mm/filemap.c b/mm/filemap.c index 8a287df..734082a 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -847,15 +847,30 @@ EXPORT_SYMBOL_GPL(add_page_wait_queue); * The mb is necessary to enforce ordering between the clear_bit and the read * of the waitqueue (to avoid SMP races with a parallel wait_on_page_locked()). */ -void unlock_page(struct page *page) +void __unlock_page(struct page *page) { - page = compound_head(page); - VM_BUG_ON_PAGE(!PageLocked(page), page); - clear_bit_unlock(PG_locked, &page->flags); - smp_mb__after_atomic(); - wake_up_page(page, PG_locked); + wait_queue_head_t *wq = page_waitqueue(page); + unsigned long flags; + + spin_lock_irqsave(&wq->lock, flags); + if (waitqueue_active(wq)) { + struct wait_bit_key key = + __WAIT_BIT_KEY_INITIALIZER(&page->flags, PG_locked); + + __wake_up_locked_key(wq, TASK_NORMAL, &key); + } else { + /* + * We need to do ClearPageContended() under wq->lock such that + * we serialize against prepare_to_wait() adding waiters and + * setting task_struct::state. + * + * See lock_page_wait(). + */ + ClearPageContended(page); + } + spin_unlock_irqrestore(&wq->lock, flags); } -EXPORT_SYMBOL(unlock_page); +EXPORT_SYMBOL(__unlock_page); /** * end_page_writeback - end writeback against a page @@ -908,6 +923,55 @@ void page_endio(struct page *page, bool is_write, int err) } EXPORT_SYMBOL_GPL(page_endio); +static int lock_page_wait(struct wait_bit_key *word, int mode) +{ + struct page *page = container_of(word->flags, struct page, flags); + + /* + * We cannot go sleep without having PG_contended set. This would mean + * nobody would issue a wakeup and we'd be stuck. + */ + if (!PageContended(page)) { + + /* + * There are two orderings of importance: + * + * 1) + * + * [unlock] [wait] + * + * clear PG_locked set PG_contended + * test PG_contended test (and-set) PG_locked + * + * Since these are on the same word, and the clear/set + * operation are atomic, they are ordered against one another. + * Program-Order further constraints a CPU from speculating the + * later load to not be earlier than the RmW. So this doesn't + * need an explicit barrier. Also see unlock_page(). + * + * 2) + * + * [unlock] [wait] + * + * LOCK wq->lock LOCK wq->lock + * __wake_up_locked || list-add + * clear PG_contended set_current_state() + * UNLOCK wq->lock UNLOCK wq->lock + * set PG_contended + * + * Since we're added to the waitqueue, we cannot get + * PG_contended cleared without also getting TASK_RUNNING set, + * which will then void the schedule() call and we'll loop. + * Here wq->lock is sufficient ordering. See __unlock_page(). + */ + SetPageContended(page); + + return 0; + } + + return bit_wait_io(word, mode); +} + /** * __lock_page - get a lock on the page, assuming we need to sleep to get it * @page: the page to lock @@ -917,7 +981,7 @@ void __lock_page(struct page *page) struct page *page_head = compound_head(page); DEFINE_WAIT_BIT(wait, &page_head->flags, PG_locked); - __wait_on_bit_lock(page_waitqueue(page_head), &wait, bit_wait_io, + __wait_on_bit_lock(page_waitqueue(page_head), &wait, lock_page_wait, TASK_UNINTERRUPTIBLE); } EXPORT_SYMBOL(__lock_page); @@ -928,10 +992,22 @@ int __lock_page_killable(struct page *page) DEFINE_WAIT_BIT(wait, &page_head->flags, PG_locked); return __wait_on_bit_lock(page_waitqueue(page_head), &wait, - bit_wait_io, TASK_KILLABLE); + lock_page_wait, TASK_KILLABLE); } EXPORT_SYMBOL_GPL(__lock_page_killable); +int wait_on_page_lock(struct page *page, int mode) +{ + struct page __always_unused *__page = (page = compound_head(page)); + DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); + + if (!PageLocked(page)) + return 0; + + return __wait_on_bit(page_waitqueue(page), &wait, lock_page_wait, mode); +} +EXPORT_SYMBOL(wait_on_page_lock); + /* * Return values: * 1 - page is locked; mmap_sem is still held. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-29 13:08 ` Peter Zijlstra @ 2016-10-03 10:47 ` Mel Gorman 0 siblings, 0 replies; 42+ messages in thread From: Mel Gorman @ 2016-10-03 10:47 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Nicholas Piggin, Rik van Riel, linux-mm On Thu, Sep 29, 2016 at 03:08:27PM +0200, Peter Zijlstra wrote: > > is not racy (the add_wait_queue() will now already guarantee that > > nobody else clears the bit). > > > > Hmm? > > Yes. I got my brain in a complete twist, but you're right, that is > indeed required. > > Here's a new version with hopefully clearer comments. > > Same caveat about 32bit, naming etc.. > I was able to run this with basic workloads over the weekend on small UMA machines. Both machines behaved similarly so I'm only reporting one from a single socket Skylake machine. NUMA machines rarely show anything much more interesting for these type of workloads but as always, the full impact is machine and workload dependant. Generally, I expect this type of patch to have marginal but detectable impact. This is a workload doing parallel dd of files large enough to trigger reclaim which locks/unlocks pages paralleldd 4.8.0-rc8 4.8.0-rc8 vanilla waitqueue-v1r2 Amean Elapsd-1 215.05 ( 0.00%) 214.53 ( 0.24%) Amean Elapsd-3 214.72 ( 0.00%) 214.42 ( 0.14%) Amean Elapsd-5 215.29 ( 0.00%) 214.88 ( 0.19%) Amean Elapsd-7 215.75 ( 0.00%) 214.79 ( 0.44%) Amean Elapsd-8 214.96 ( 0.00%) 215.21 ( -0.12%) That's basically within the noise. CPU usage overall looks like 4.8.0-rc8 4.8.0-rc8 vanillawaitqueue-v1r2 User 3409.66 3421.72 System 18298.66 18251.99 Elapsed 7178.82 7181.14 Marginal decrease in system CPU usage. Profiles showed the vanilla kernel spending less than 0.1% on unlock_page but it's eliminated by the patch. This is some microbenchmarks from the vm-scalability benchmark. It's similar to dd in that it triggers reclaim from a single thread vmscale 4.8.0-rc8 4.8.0-rc8 vanilla waitqueue-v1r2 Ops lru-file-mmap-read-elapsed 19.50 ( 0.00%) 19.43 ( 0.36%) Ops lru-file-readonce-elapsed 12.44 ( 0.00%) 12.29 ( 1.21%) Ops lru-file-readtwice-elapsed 22.27 ( 0.00%) 22.19 ( 0.36%) Ops lru-memcg-elapsed 12.18 ( 0.00%) 12.00 ( 1.48%) 4.8.0-rc8 4.8.0-rc8 vanillawaitqueue-v1r2 User 50.54 50.88 System 398.72 388.81 Elapsed 69.48 68.99 Again, differences are marginal but detectable. I accidentally did not collect profile data but I have no reason to believe it's significantly different to dd. This is "gitsource" from mmtests but it's a checkout of the git source tree and a run of make test which is where Linus first noticed the problem. The metric here is time-based, I don't actually check the results of the regression suite. gitsource 4.8.0-rc8 4.8.0-rc8 vanilla waitqueue-v1r2 User min 192.28 ( 0.00%) 192.49 ( -0.11%) User mean 193.55 ( 0.00%) 194.88 ( -0.69%) User stddev 1.52 ( 0.00%) 2.39 (-57.58%) User coeffvar 0.79 ( 0.00%) 1.23 (-56.51%) User max 196.34 ( 0.00%) 199.06 ( -1.39%) System min 122.70 ( 0.00%) 118.69 ( 3.27%) System mean 123.87 ( 0.00%) 120.68 ( 2.57%) System stddev 0.84 ( 0.00%) 1.65 (-97.67%) System coeffvar 0.67 ( 0.00%) 1.37 (-102.89%) System max 124.95 ( 0.00%) 123.14 ( 1.45%) Elapsed min 718.09 ( 0.00%) 711.48 ( 0.92%) Elapsed mean 724.23 ( 0.00%) 716.52 ( 1.07%) Elapsed stddev 4.20 ( 0.00%) 4.84 (-15.42%) Elapsed coeffvar 0.58 ( 0.00%) 0.68 (-16.66%) Elapsed max 730.51 ( 0.00%) 724.45 ( 0.83%) 4.8.0-rc8 4.8.0-rc8 vanillawaitqueue-v1r2 User 2730.60 2808.48 System 2184.85 2108.68 Elapsed 9938.01 9929.56 Overall, it's showing a drop in system CPU usage as expected. The detailed results show a drop of 2.57% in system CPU usage running the benchmark itself and 3.48% overall which is measuring everything and not just "make test". The drop in elapsed time is marginal but measurable. It may raise an eyebrow that the overall elapsed time doesn't match the detailed results. The detailed results report 5 iterations of "make test" without profiling enabled which takes takes about an hour. The way I configured it, the profiled run happened immediately after it and it's much slower as well as having to compile git itself which takes a few minutes. This is the top lock/unlock activity in the vanilla kernel 0.80% git [kernel.vmlinux] [k] unlock_page 0.28% sh [kernel.vmlinux] [k] unlock_page 0.20% git-rebase [kernel.vmlinux] [k] unlock_page 0.13% git [kernel.vmlinux] [k] lock_page_memcg 0.10% git [kernel.vmlinux] [k] unlock_page_memcg 0.07% git-submodule [kernel.vmlinux] [k] unlock_page 0.04% sh [kernel.vmlinux] [k] lock_page_memcg 0.03% git-rebase [kernel.vmlinux] [k] lock_page_memcg 0.03% sh [kernel.vmlinux] [k] unlock_page_memcg 0.03% sed [kernel.vmlinux] [k] unlock_page 0.03% perf [kernel.vmlinux] [k] unlock_page 0.02% git-rebase [kernel.vmlinux] [k] unlock_page_memcg 0.02% rm [kernel.vmlinux] [k] unlock_page 0.02% git-stash [kernel.vmlinux] [k] unlock_page 0.02% git-bisect [kernel.vmlinux] [k] unlock_page 0.02% diff [kernel.vmlinux] [k] unlock_page 0.02% cat [kernel.vmlinux] [k] unlock_page 0.02% wc [kernel.vmlinux] [k] unlock_page 0.01% mv [kernel.vmlinux] [k] unlock_page 0.01% git-submodule [kernel.vmlinux] [k] lock_page_memcg This is with the patch applied 0.49% git [kernel.vmlinux] [k] unlock_page 0.14% sh [kernel.vmlinux] [k] unlock_page 0.13% git [kernel.vmlinux] [k] lock_page_memcg 0.11% git-rebase [kernel.vmlinux] [k] unlock_page 0.10% git [kernel.vmlinux] [k] unlock_page_memcg 0.04% sh [kernel.vmlinux] [k] lock_page_memcg 0.04% git-submodule [kernel.vmlinux] [k] unlock_page 0.03% sh [kernel.vmlinux] [k] unlock_page_memcg 0.03% git-rebase [kernel.vmlinux] [k] lock_page_memcg 0.02% git-rebase [kernel.vmlinux] [k] unlock_page_memcg 0.02% sed [kernel.vmlinux] [k] unlock_page 0.01% rm [kernel.vmlinux] [k] unlock_page 0.01% git-stash [kernel.vmlinux] [k] unlock_page 0.01% git-submodule [kernel.vmlinux] [k] lock_page_memcg 0.01% git-bisect [kernel.vmlinux] [k] unlock_page 0.01% diff [kernel.vmlinux] [k] unlock_page 0.01% cat [kernel.vmlinux] [k] unlock_page 0.01% wc [kernel.vmlinux] [k] unlock_page 0.01% git-submodule [kernel.vmlinux] [k] unlock_page_memcg 0.01% mv [kernel.vmlinux] [k] unlock_page The drop in time spent by git in unlock_page is noticable. I did not drill down into the annotated profile but this roughly matches what I measured before when avoiding page_waitqueue lookups. The full profile is not exactly great but I didn't see anything in there I haven't seen before. Top entries with the patch applied looks like this 7.44% swapper [kernel.vmlinux] [k] intel_idle 1.25% git [kernel.vmlinux] [k] filemap_map_pages 1.08% git [kernel.vmlinux] [k] native_irq_return_iret 0.79% git [kernel.vmlinux] [k] unmap_page_range 0.56% git [kernel.vmlinux] [k] release_pages 0.51% git [kernel.vmlinux] [k] handle_mm_fault 0.49% git [kernel.vmlinux] [k] unlock_page 0.46% git [kernel.vmlinux] [k] page_remove_rmap 0.46% git [kernel.vmlinux] [k] _raw_spin_lock 0.42% git [kernel.vmlinux] [k] clear_page_c_e Lot of map/unmap activity like you'd expect and release_pages being a pig as usual. Overall, this patch shows similar behaviour to my own patch from 2014. There is a definite benefit but it's marginal. The big difference is that this patch is a lot similar than the 2014 version and may meet less resistance as a result. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-27 8:31 ` Mel Gorman 2016-09-27 14:34 ` Peter Zijlstra @ 2016-09-27 14:53 ` Nicholas Piggin 2016-09-27 15:17 ` Nicholas Piggin ` (2 more replies) 1 sibling, 3 replies; 42+ messages in thread From: Nicholas Piggin @ 2016-09-27 14:53 UTC (permalink / raw) To: Mel Gorman Cc: Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Peter Zijlstra, Rik van Riel, linux-mm On Tue, 27 Sep 2016 09:31:04 +0100 Mel Gorman <mgorman@techsingularity.net> wrote: > Hi Linus, > > On Mon, Sep 26, 2016 at 01:58:00PM -0700, Linus Torvalds wrote: > > So I've been doing some profiling of the git "make test" load, which > > is interesting because it's just a lot of small scripts, and it shows > > our fork/execve/exit costs. > > > > Some of the top offenders are pretty understandable: we have long had > > unmap_page_range() show up at the top on these loads, because the > > zap_pte_range() function ends up touching each page as it unmaps > > things (it does it to check whether it's an anonymous page, but then > > also for the page map count update etc), and that's a disaster from a > > cache standpoint. That single function is something between 3-4% of > > CPU time, and the one instruction that accesses "struct page" the > > first time is a large portion of that. Yes, a single instruction is > > blamed for about 1% of all CPU time on a fork/exec/exit workload. > > > > It was found at one point that the fault-around made these costs worse as > there were simply more pages to tear down. However, this only applied to > fork/exit microbenchmarks. Matt Fleming prototyped an unreleased patch > that tried to be clever about this but the cost was never worthwhile. A > plain revert helped a microbenchmark but hurt workloads like the git > testsuite which was shell intensive. > > It got filed under "we're not fixing a fork/exit microbenchmark at the > expense of "real" workloads like git checkout and git testsuite". > > > <SNIP> > > > > #5 and #6 on my profile are user space (_int_malloc in glibc, and > > do_lookup_x in the loader - I think user space should probably start > > thinking more about doing static libraries for the really core basic > > things, but whatever. Not a kernel issue. > > > > Recent problems have been fixed with _int_malloc in glibc, particularly as it > applies to threads but no fix springs to mind that might impact "make test". > > > #7 is in the kernel again. And that one struck me as really odd. It's > > "unlock_page()", while #9 is __wake_up_bit(). WTF? There's no IO in > > this load, it's all cached, why do we use 3% of the time (1.7% and > > 1.4% respectively) on unlocking a page. And why can't I see the > > locking part? > > > > It turns out that I *can* see the locking part, but it's pretty cheap. > > It's inside of filemap_map_pages(), which does a trylock, and it shows > > up as about 1/6th of the cost of that function. Still, it's much less > > than the unlocking side. Why is unlocking so expensive? > > > > Yeah, unlocking is expensive because of the nasty __wake_up_bit() > > code. In fact, even inside "unlock_page()" itself, most of the costs > > aren't even the atomic bit clearing (like you'd expect), it's the > > inlined part of wake_up_bit(). Which does some really nasty crud. > > > > Why is the page_waitqueue() handling so expensive? Let me count the ways: > > > > page_waitqueue() has been a hazard for years. I think the last attempt to > fix it was back in 2014 http://www.spinics.net/lists/linux-mm/msg73207.html > > The patch is heavily derived from work by Nick Piggin who noticed the years > before that. I think that was the last version I posted and the changelog > includes profile data. I don't have an exact reference why it was rejected > but a consistent piece of feedback was that it was very complex for the > level of impact it had. Huh, I was just wondering about this again the other day. Powerpc has some interesting issues with atomic ops and barriers (not to mention random cache misses that hurt everybody). It actually wasn't for big Altix machines (at least not when I wrote it), but it came from some effort to optimize page reclaim performance on an opteron with a lot (back then) of cores per node. And it's not only for scalability, it's a single threaded performance optimisation as much as anything. By the way I think that patch linked is taking the wrong approach. Better to put all the complexity of maintaining the waiters bit into the sleep/wake functions. The fastpath simply tests the bit in no less racy a manner than the unlocked waitqueue_active() test. Really incomplete patch attached for reference. The part where we hack the wait code into maintaining the extra bit for us is pretty mechanical and boring so long as it's under the waitqueue lock. The more interesting is the ability to avoid the barrier between fastpath clearing a bit and testing for waiters. unlock(): lock() (slowpath): clear_bit(PG_locked) set_bit(PG_waiter) test_bit(PG_waiter) test_bit(PG_locked) If this was memory ops to different words, it would require smp_mb each side.. Being the same word, can we avoid them? ISTR Linus you were worried about stores being forwarded to loads before it is visible to the other CPU. I think that should be okay because the stores will be ordered, and the load can't move earlier than the store on the same CPU. Maybe I completely misremember it. Subject: [PATCH] blah --- include/linux/pagemap.h | 10 +++--- mm/filemap.c | 92 +++++++++++++++++++++++++++++++++++++++++++------ 2 files changed, 87 insertions(+), 15 deletions(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 66a1260..a536df9 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -479,10 +479,11 @@ static inline int wait_on_page_locked_killable(struct page *page) return wait_on_page_bit_killable(compound_head(page), PG_locked); } -extern wait_queue_head_t *page_waitqueue(struct page *page); static inline void wake_up_page(struct page *page, int bit) { - __wake_up_bit(page_waitqueue(page), &page->flags, bit); + if (!PageWaiters(page)) + return; + wake_up_page_bit(page, bit); } /* @@ -494,8 +495,9 @@ static inline void wake_up_page(struct page *page, int bit) */ static inline void wait_on_page_locked(struct page *page) { - if (PageLocked(page)) - wait_on_page_bit(compound_head(page), PG_locked); + if (!PageLocked(page)) + return 0; + wait_on_page_bit(compound_head(page), PG_locked); } /* diff --git a/mm/filemap.c b/mm/filemap.c index 8a287df..09bca8a 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -775,24 +775,98 @@ EXPORT_SYMBOL(__page_cache_alloc); * at a cost of "thundering herd" phenomena during rare hash * collisions. */ -wait_queue_head_t *page_waitqueue(struct page *page) +static wait_queue_head_t *page_waitqueue(struct page *page) { const struct zone *zone = page_zone(page); return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)]; } -EXPORT_SYMBOL(page_waitqueue); + +struct wait_page_key { + struct page *page; + int bit_nr; + int page_match; +}; + +struct wait_page_queue { + struct wait_page_key key; + wait_queue_t wait; +}; + +static int wake_page_function(wait_queue_t *wait, unsigned mode, int sync, void *arg) +{ + struct wait_page_queue *wait = container_of(wait, struct wait_page_queue, wait); + struct wait_page_key *key = arg; + int ret; + + if (wait->key.page != key->page) + return 0; + key->page_match = 1; + if (wait->key.bit_nr != key->bit_nr || + test_bit(key->bit_nr, &key->page->flags)) + return 0; + + ret = try_to_wake_up(wait->wait.private, mode, sync); + if (ret) + list_del_init(&wait->task_list); + return ret; +} void wait_on_page_bit(struct page *page, int bit_nr) { - DEFINE_WAIT_BIT(wait, &page->flags, bit_nr); + wait_queue_head_t *wq = page_waitqueue(page); + struct wait_page_queue wait; - if (test_bit(bit_nr, &page->flags)) - __wait_on_bit(page_waitqueue(page), &wait, bit_wait_io, - TASK_UNINTERRUPTIBLE); + init_wait(&wait.wait); + wait.wait.func = wake_page_function; + wait.key.page = page; + wait.key.bit_nr = bit_nr; + /* wait.key.page_match unused */ + + wait.flags &= ~WQ_FLAG_EXCLUSIVE; + +again: + spin_lock_irq(&wq->lock); + if (unlikely(!test_bit(bit_nr, &page->flags))) { + spin_unlock_irq(&wq->lock); + return; + } + + if (list_empty(&wait->task_list)) { + __add_wait_queue(wq, &wait); + if (!PageWaiters(page)) + SetPageWaiters(page); + } + set_current_state(TASK_UNINTERRUPTIBLE); + spin_unlock_irq(&wq->lock); + + io_schedule(); + + if (!test_bit(bit_nr, &page->flags)) + goto again; + + finish_wait(wq, &wait); } EXPORT_SYMBOL(wait_on_page_bit); +void wake_up_page_bit(struct page *page, int bit_nr) +{ + wait_queue_head_t *wq; + struct wait_page_key key; + unsigned long flags; + + key.page = page; + key.bit_nr = bit_nr; + key.page_match = 0; + + wq = page_waitqueue(page); + spin_lock_irqsave(&wq->lock, flags); + __wake_up_locked_key(wq, TASK_NORMAL, &key); + if (!waitqueue_active(wq) || !key.page_match) + ClearPageWaiters(page); + spin_unlock_irqrestore(&wq->lock, flags); +} + int wait_on_page_bit_killable(struct page *page, int bit_nr) { DEFINE_WAIT_BIT(wait, &page->flags, bit_nr); @@ -852,7 +926,6 @@ void unlock_page(struct page *page) page = compound_head(page); VM_BUG_ON_PAGE(!PageLocked(page), page); clear_bit_unlock(PG_locked, &page->flags); - smp_mb__after_atomic(); wake_up_page(page, PG_locked); } EXPORT_SYMBOL(unlock_page); @@ -875,10 +948,7 @@ void end_page_writeback(struct page *page) rotate_reclaimable_page(page); } - if (!test_clear_page_writeback(page)) - BUG(); - - smp_mb__after_atomic(); + clear_bit_unlock(PG_writeback, &page->flags); wake_up_page(page, PG_writeback); } EXPORT_SYMBOL(end_page_writeback); -- 2.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-27 14:53 ` Nicholas Piggin @ 2016-09-27 15:17 ` Nicholas Piggin 2016-09-27 16:52 ` Peter Zijlstra 2016-09-28 7:40 ` Mel Gorman 2 siblings, 0 replies; 42+ messages in thread From: Nicholas Piggin @ 2016-09-27 15:17 UTC (permalink / raw) To: Mel Gorman Cc: Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Peter Zijlstra, Rik van Riel, linux-mm On Wed, 28 Sep 2016 00:53:18 +1000 Nicholas Piggin <npiggin@gmail.com> wrote: > On Tue, 27 Sep 2016 09:31:04 +0100 > Mel Gorman <mgorman@techsingularity.net> wrote: > > > Hi Linus, > > > > On Mon, Sep 26, 2016 at 01:58:00PM -0700, Linus Torvalds wrote: > > > So I've been doing some profiling of the git "make test" load, which > > > is interesting because it's just a lot of small scripts, and it shows > > > our fork/execve/exit costs. > > > > > > Some of the top offenders are pretty understandable: we have long had > > > unmap_page_range() show up at the top on these loads, because the > > > zap_pte_range() function ends up touching each page as it unmaps > > > things (it does it to check whether it's an anonymous page, but then > > > also for the page map count update etc), and that's a disaster from a > > > cache standpoint. That single function is something between 3-4% of > > > CPU time, and the one instruction that accesses "struct page" the > > > first time is a large portion of that. Yes, a single instruction is > > > blamed for about 1% of all CPU time on a fork/exec/exit workload. > > > > > > > It was found at one point that the fault-around made these costs worse as > > there were simply more pages to tear down. However, this only applied to > > fork/exit microbenchmarks. Matt Fleming prototyped an unreleased patch > > that tried to be clever about this but the cost was never worthwhile. A > > plain revert helped a microbenchmark but hurt workloads like the git > > testsuite which was shell intensive. > > > > It got filed under "we're not fixing a fork/exit microbenchmark at the > > expense of "real" workloads like git checkout and git testsuite". > > > > > <SNIP> > > > > > > #5 and #6 on my profile are user space (_int_malloc in glibc, and > > > do_lookup_x in the loader - I think user space should probably start > > > thinking more about doing static libraries for the really core basic > > > things, but whatever. Not a kernel issue. > > > > > > > Recent problems have been fixed with _int_malloc in glibc, particularly as it > > applies to threads but no fix springs to mind that might impact "make test". > > > > > #7 is in the kernel again. And that one struck me as really odd. It's > > > "unlock_page()", while #9 is __wake_up_bit(). WTF? There's no IO in > > > this load, it's all cached, why do we use 3% of the time (1.7% and > > > 1.4% respectively) on unlocking a page. And why can't I see the > > > locking part? > > > > > > It turns out that I *can* see the locking part, but it's pretty cheap. > > > It's inside of filemap_map_pages(), which does a trylock, and it shows > > > up as about 1/6th of the cost of that function. Still, it's much less > > > than the unlocking side. Why is unlocking so expensive? > > > > > > Yeah, unlocking is expensive because of the nasty __wake_up_bit() > > > code. In fact, even inside "unlock_page()" itself, most of the costs > > > aren't even the atomic bit clearing (like you'd expect), it's the > > > inlined part of wake_up_bit(). Which does some really nasty crud. > > > > > > Why is the page_waitqueue() handling so expensive? Let me count the ways: > > > > > > > page_waitqueue() has been a hazard for years. I think the last attempt to > > fix it was back in 2014 http://www.spinics.net/lists/linux-mm/msg73207.html > > > > The patch is heavily derived from work by Nick Piggin who noticed the years > > before that. I think that was the last version I posted and the changelog > > includes profile data. I don't have an exact reference why it was rejected > > but a consistent piece of feedback was that it was very complex for the > > level of impact it had. > > Huh, I was just wondering about this again the other day. Powerpc has > some interesting issues with atomic ops and barriers (not to mention > random cache misses that hurt everybody). > > It actually wasn't for big Altix machines (at least not when I wrote it), > but it came from some effort to optimize page reclaim performance on an > opteron with a lot (back then) of cores per node. > > And it's not only for scalability, it's a single threaded performance > optimisation as much as anything. > > By the way I think that patch linked is taking the wrong approach. Better > to put all the complexity of maintaining the waiters bit into the sleep/wake > functions. The fastpath simply tests the bit in no less racy a manner than > the unlocked waitqueue_active() test. Really incomplete patch attached for > reference. > > The part where we hack the wait code into maintaining the extra bit for us > is pretty mechanical and boring so long as it's under the waitqueue lock. > > The more interesting is the ability to avoid the barrier between fastpath > clearing a bit and testing for waiters. > > unlock(): lock() (slowpath): > clear_bit(PG_locked) set_bit(PG_waiter) > test_bit(PG_waiter) test_bit(PG_locked) > > If this was memory ops to different words, it would require smp_mb each > side.. Being the same word, can we avoid them? ISTR Linus you were worried > about stores being forwarded to loads before it is visible to the other CPU. > I think that should be okay because the stores will be ordered, and the load > can't move earlier than the store on the same CPU. Maybe I completely > misremember it. > > > Subject: [PATCH] blah > > --- > include/linux/pagemap.h | 10 +++--- > mm/filemap.c | 92 +++++++++++++++++++++++++++++++++++++++++++------ > 2 files changed, 87 insertions(+), 15 deletions(-) > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > index 66a1260..a536df9 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -479,10 +479,11 @@ static inline int wait_on_page_locked_killable(struct page *page) > return wait_on_page_bit_killable(compound_head(page), PG_locked); > } > > -extern wait_queue_head_t *page_waitqueue(struct page *page); > static inline void wake_up_page(struct page *page, int bit) > { > - __wake_up_bit(page_waitqueue(page), &page->flags, bit); > + if (!PageWaiters(page)) > + return; > + wake_up_page_bit(page, bit); > } > > /* > @@ -494,8 +495,9 @@ static inline void wake_up_page(struct page *page, int bit) > */ > static inline void wait_on_page_locked(struct page *page) > { > - if (PageLocked(page)) > - wait_on_page_bit(compound_head(page), PG_locked); > + if (!PageLocked(page)) > + return 0; > + wait_on_page_bit(compound_head(page), PG_locked); > } > > /* > diff --git a/mm/filemap.c b/mm/filemap.c > index 8a287df..09bca8a 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -775,24 +775,98 @@ EXPORT_SYMBOL(__page_cache_alloc); > * at a cost of "thundering herd" phenomena during rare hash > * collisions. > */ > -wait_queue_head_t *page_waitqueue(struct page *page) > +static wait_queue_head_t *page_waitqueue(struct page *page) > { > const struct zone *zone = page_zone(page); > > return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)]; > } > -EXPORT_SYMBOL(page_waitqueue); > + > +struct wait_page_key { > + struct page *page; > + int bit_nr; > + int page_match; > +}; > + > +struct wait_page_queue { > + struct wait_page_key key; > + wait_queue_t wait; > +}; > + > +static int wake_page_function(wait_queue_t *wait, unsigned mode, int sync, void *arg) > +{ > + struct wait_page_queue *wait = container_of(wait, struct wait_page_queue, wait); > + struct wait_page_key *key = arg; > + int ret; > + > + if (wait->key.page != key->page) > + return 0; > + key->page_match = 1; > + if (wait->key.bit_nr != key->bit_nr || > + test_bit(key->bit_nr, &key->page->flags)) > + return 0; > + > + ret = try_to_wake_up(wait->wait.private, mode, sync); > + if (ret) > + list_del_init(&wait->task_list); > + return ret; > +} > > void wait_on_page_bit(struct page *page, int bit_nr) > { > - DEFINE_WAIT_BIT(wait, &page->flags, bit_nr); > + wait_queue_head_t *wq = page_waitqueue(page); > + struct wait_page_queue wait; > > - if (test_bit(bit_nr, &page->flags)) > - __wait_on_bit(page_waitqueue(page), &wait, bit_wait_io, > - TASK_UNINTERRUPTIBLE); > + init_wait(&wait.wait); > + wait.wait.func = wake_page_function; > + wait.key.page = page; > + wait.key.bit_nr = bit_nr; > + /* wait.key.page_match unused */ > + > + wait.flags &= ~WQ_FLAG_EXCLUSIVE; > + > +again: > + spin_lock_irq(&wq->lock); > + if (unlikely(!test_bit(bit_nr, &page->flags))) { > + spin_unlock_irq(&wq->lock); > + return; > + } > + > + if (list_empty(&wait->task_list)) { > + __add_wait_queue(wq, &wait); > + if (!PageWaiters(page)) > + SetPageWaiters(page); > + } Ugh, I even wrote the correct ordering in the email, but did it the wrong way here. Anyway, you get the idea. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-27 14:53 ` Nicholas Piggin 2016-09-27 15:17 ` Nicholas Piggin @ 2016-09-27 16:52 ` Peter Zijlstra 2016-09-27 17:06 ` Nicholas Piggin 2016-09-28 7:40 ` Mel Gorman 2 siblings, 1 reply; 42+ messages in thread From: Peter Zijlstra @ 2016-09-27 16:52 UTC (permalink / raw) To: Nicholas Piggin Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel, linux-mm, Will Deacon On Wed, Sep 28, 2016 at 12:53:18AM +1000, Nicholas Piggin wrote: > The more interesting is the ability to avoid the barrier between fastpath > clearing a bit and testing for waiters. > > unlock(): lock() (slowpath): > clear_bit(PG_locked) set_bit(PG_waiter) > test_bit(PG_waiter) test_bit(PG_locked) > > If this was memory ops to different words, it would require smp_mb each > side.. Being the same word, can we avoid them? Ah, that is the reason I put that smp_mb__after_atomic() there. You have a cute point on them being to the same word though. Need to think about that. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-27 16:52 ` Peter Zijlstra @ 2016-09-27 17:06 ` Nicholas Piggin 2016-09-28 7:05 ` Peter Zijlstra 0 siblings, 1 reply; 42+ messages in thread From: Nicholas Piggin @ 2016-09-27 17:06 UTC (permalink / raw) To: Peter Zijlstra Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel, linux-mm, Will Deacon On Tue, 27 Sep 2016 18:52:21 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Sep 28, 2016 at 12:53:18AM +1000, Nicholas Piggin wrote: > > The more interesting is the ability to avoid the barrier between fastpath > > clearing a bit and testing for waiters. > > > > unlock(): lock() (slowpath): > > clear_bit(PG_locked) set_bit(PG_waiter) > > test_bit(PG_waiter) test_bit(PG_locked) > > > > If this was memory ops to different words, it would require smp_mb each > > side.. Being the same word, can we avoid them? > > Ah, that is the reason I put that smp_mb__after_atomic() there. You have > a cute point on them being to the same word though. Need to think about > that. This is all assuming the store accesses are ordered, which you should get if the stores to the different bits operate on the same address and size. That might not be the case for some architectures, but they might not require barriers for other reasons. That would call for an smp_mb variant that is used for bitops on different bits but same aligned long. Thanks, Nick -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-27 17:06 ` Nicholas Piggin @ 2016-09-28 7:05 ` Peter Zijlstra 2016-09-28 11:05 ` Paul E. McKenney 2016-09-29 1:31 ` Nicholas Piggin 0 siblings, 2 replies; 42+ messages in thread From: Peter Zijlstra @ 2016-09-28 7:05 UTC (permalink / raw) To: Nicholas Piggin Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel, linux-mm, Will Deacon, Paul McKenney, Alan Stern On Wed, Sep 28, 2016 at 03:06:21AM +1000, Nicholas Piggin wrote: > On Tue, 27 Sep 2016 18:52:21 +0200 > Peter Zijlstra <peterz@infradead.org> wrote: > > > On Wed, Sep 28, 2016 at 12:53:18AM +1000, Nicholas Piggin wrote: > > > The more interesting is the ability to avoid the barrier between fastpath > > > clearing a bit and testing for waiters. > > > > > > unlock(): lock() (slowpath): > > > clear_bit(PG_locked) set_bit(PG_waiter) > > > test_bit(PG_waiter) test_bit(PG_locked) > > > > > > If this was memory ops to different words, it would require smp_mb each > > > side.. Being the same word, can we avoid them? > > > > Ah, that is the reason I put that smp_mb__after_atomic() there. You have > > a cute point on them being to the same word though. Need to think about > > that. > > This is all assuming the store accesses are ordered, which you should get > if the stores to the different bits operate on the same address and size. > That might not be the case for some architectures, but they might not > require barriers for other reasons. That would call for an smp_mb variant > that is used for bitops on different bits but same aligned long. Since the {set,clear}_bit operations are atomic, they must be ordered against one another. The subsequent test_bit is a load, which, since its to the same variable, and a CPU must appear to preserve Program-Order, must come after the RmW. So I think you're right and that we can forgo the memory barriers here. I even think this must be true on all architectures. Paul and Alan have a validation tool someplace, put them on Cc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-28 7:05 ` Peter Zijlstra @ 2016-09-28 11:05 ` Paul E. McKenney 2016-09-28 11:16 ` Peter Zijlstra 2016-09-29 1:31 ` Nicholas Piggin 1 sibling, 1 reply; 42+ messages in thread From: Paul E. McKenney @ 2016-09-28 11:05 UTC (permalink / raw) To: Peter Zijlstra Cc: Nicholas Piggin, Mel Gorman, Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel, linux-mm, Will Deacon, Alan Stern On Wed, Sep 28, 2016 at 09:05:46AM +0200, Peter Zijlstra wrote: > On Wed, Sep 28, 2016 at 03:06:21AM +1000, Nicholas Piggin wrote: > > On Tue, 27 Sep 2016 18:52:21 +0200 > > Peter Zijlstra <peterz@infradead.org> wrote: > > > > > On Wed, Sep 28, 2016 at 12:53:18AM +1000, Nicholas Piggin wrote: > > > > The more interesting is the ability to avoid the barrier between fastpath > > > > clearing a bit and testing for waiters. > > > > > > > > unlock(): lock() (slowpath): > > > > clear_bit(PG_locked) set_bit(PG_waiter) > > > > test_bit(PG_waiter) test_bit(PG_locked) The point being that at least one of the test_bit() calls must return true? > > > > If this was memory ops to different words, it would require smp_mb each > > > > side.. Being the same word, can we avoid them? > > > > > > Ah, that is the reason I put that smp_mb__after_atomic() there. You have > > > a cute point on them being to the same word though. Need to think about > > > that. > > > > This is all assuming the store accesses are ordered, which you should get > > if the stores to the different bits operate on the same address and size. > > That might not be the case for some architectures, but they might not > > require barriers for other reasons. That would call for an smp_mb variant > > that is used for bitops on different bits but same aligned long. As far as I know, all architectures fully order aligned same-size machine-sized accesses to the same location even without barriers. In the example above, the PG_locked and PG_waiter are different bits in the same location, correct? (Looks that way, but the above also looks a bit abbreviated.) Not sure that Docuemntation/memory-barriers.txt is as clear as it should be on this one, though. > Since the {set,clear}_bit operations are atomic, they must be ordered > against one another. The subsequent test_bit is a load, which, since its > to the same variable, and a CPU must appear to preserve Program-Order, > must come after the RmW. But from Documentation/atomic_ops.h: ------------------------------------------------------------------------ void set_bit(unsigned long nr, volatile unsigned long *addr); void clear_bit(unsigned long nr, volatile unsigned long *addr); void change_bit(unsigned long nr, volatile unsigned long *addr); These routines set, clear, and change, respectively, the bit number indicated by "nr" on the bit mask pointed to by "ADDR". They must execute atomically, yet there are no implicit memory barrier semantics required of these interfaces. ------------------------------------------------------------------------ So unless they operate on the same location or are accompanied by something like the smp_mb__after_atomic() called out above, there is no ordering. > So I think you're right and that we can forgo the memory barriers here. > I even think this must be true on all architectures. > > Paul and Alan have a validation tool someplace, put them on Cc. It does not yet fully handle atomics yet (but maybe Alan is ahead of me here, in which case he won't be shy). However, the point about strong ordering of same-sized aligned accesses to a machine-sized location can be made without atomics: ------------------------------------------------------------------------ C C-piggin-SB+samevar.litmus { x=0; } P0(int *x) { WRITE_ONCE(*x, 1); r1 = READ_ONCE(*x); } P1(int *x) { WRITE_ONCE(*x, 2); r1 = READ_ONCE(*x); } exists (0:r1=0 /\ 1:r1=0) ------------------------------------------------------------------------ This gets us the following, as expected: Test C-piggin-SB+samevar Allowed States 3 0:r1=1; 1:r1=1; 0:r1=1; 1:r1=2; 0:r1=2; 1:r1=2; No Witnesses Positive: 0 Negative: 4 Condition exists (0:r1=0 /\ 1:r1=0) Observation C-piggin-SB+samevar Never 0 4 Hash=31466d003c10c07836583f5879824502 Thanx, Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-28 11:05 ` Paul E. McKenney @ 2016-09-28 11:16 ` Peter Zijlstra 2016-09-28 12:58 ` Paul E. McKenney 0 siblings, 1 reply; 42+ messages in thread From: Peter Zijlstra @ 2016-09-28 11:16 UTC (permalink / raw) To: Paul E. McKenney Cc: Nicholas Piggin, Mel Gorman, Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel, linux-mm, Will Deacon, Alan Stern On Wed, Sep 28, 2016 at 04:05:30AM -0700, Paul E. McKenney wrote: > On Wed, Sep 28, 2016 at 09:05:46AM +0200, Peter Zijlstra wrote: > > On Wed, Sep 28, 2016 at 03:06:21AM +1000, Nicholas Piggin wrote: > > > On Tue, 27 Sep 2016 18:52:21 +0200 > > > Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > > On Wed, Sep 28, 2016 at 12:53:18AM +1000, Nicholas Piggin wrote: > > > > > The more interesting is the ability to avoid the barrier between fastpath > > > > > clearing a bit and testing for waiters. > > > > > > > > > > unlock(): lock() (slowpath): > > > > > clear_bit(PG_locked) set_bit(PG_waiter) > > > > > test_bit(PG_waiter) test_bit(PG_locked) > > The point being that at least one of the test_bit() calls must return > true? Yes, more or less. Either unlock() observes PG_waiters set, or lock() observes PG_locked unset. (opposed to all our 'normal' examples the initial state isn't all 0 and the stores aren't all 1 :-). > As far as I know, all architectures fully order aligned same-size > machine-sized accesses to the same location even without barriers. > In the example above, the PG_locked and PG_waiter are different bits in > the same location, correct? (Looks that way, but the above also looks > a bit abbreviated.) Correct, PG_* all live in the same word. > So unless they operate on the same location or are accompanied by > something like the smp_mb__after_atomic() called out above, there > is no ordering. Same word.. > > So I think you're right and that we can forgo the memory barriers here. > > I even think this must be true on all architectures. > > > > Paul and Alan have a validation tool someplace, put them on Cc. > > It does not yet fully handle atomics yet (but maybe Alan is ahead of > me here, in which case he won't be shy). However, the point about > strong ordering of same-sized aligned accesses to a machine-sized > location can be made without atomics: Great. That's what I remember from reading that stuff. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-28 11:16 ` Peter Zijlstra @ 2016-09-28 12:58 ` Paul E. McKenney 0 siblings, 0 replies; 42+ messages in thread From: Paul E. McKenney @ 2016-09-28 12:58 UTC (permalink / raw) To: Peter Zijlstra Cc: Nicholas Piggin, Mel Gorman, Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel, linux-mm, Will Deacon, Alan Stern On Wed, Sep 28, 2016 at 01:16:45PM +0200, Peter Zijlstra wrote: > On Wed, Sep 28, 2016 at 04:05:30AM -0700, Paul E. McKenney wrote: > > On Wed, Sep 28, 2016 at 09:05:46AM +0200, Peter Zijlstra wrote: > > > On Wed, Sep 28, 2016 at 03:06:21AM +1000, Nicholas Piggin wrote: > > > > On Tue, 27 Sep 2016 18:52:21 +0200 > > > > Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > > > > On Wed, Sep 28, 2016 at 12:53:18AM +1000, Nicholas Piggin wrote: > > > > > > The more interesting is the ability to avoid the barrier between fastpath > > > > > > clearing a bit and testing for waiters. > > > > > > > > > > > > unlock(): lock() (slowpath): > > > > > > clear_bit(PG_locked) set_bit(PG_waiter) > > > > > > test_bit(PG_waiter) test_bit(PG_locked) > > > > The point being that at least one of the test_bit() calls must return > > true? > > Yes, more or less. Either unlock() observes PG_waiters set, or lock() > observes PG_locked unset. (opposed to all our 'normal' examples the > initial state isn't all 0 and the stores aren't all 1 :-). You lost me on unlock() doing any observation, but yes, I transliterated to standard form, unintentionally, as it turns out. ;-) So the goal is that either test_bit(PG_waiter) sees the set_bit() or test_bit(PG_locked) sees the clear_bit(), correct? > > As far as I know, all architectures fully order aligned same-size > > machine-sized accesses to the same location even without barriers. > > In the example above, the PG_locked and PG_waiter are different bits in > > the same location, correct? (Looks that way, but the above also looks > > a bit abbreviated.) > > Correct, PG_* all live in the same word. That should make things somewhat more reliable. ;-) > > So unless they operate on the same location or are accompanied by > > something like the smp_mb__after_atomic() called out above, there > > is no ordering. > > Same word.. > > > > So I think you're right and that we can forgo the memory barriers here. > > > I even think this must be true on all architectures. > > > > > > Paul and Alan have a validation tool someplace, put them on Cc. > > > > It does not yet fully handle atomics yet (but maybe Alan is ahead of > > me here, in which case he won't be shy). However, the point about > > strong ordering of same-sized aligned accesses to a machine-sized > > location can be made without atomics: > > Great. That's what I remember from reading that stuff. ;-) Thanx, Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-28 7:05 ` Peter Zijlstra 2016-09-28 11:05 ` Paul E. McKenney @ 2016-09-29 1:31 ` Nicholas Piggin 2016-09-29 2:12 ` Paul E. McKenney 2016-09-29 6:21 ` Peter Zijlstra 1 sibling, 2 replies; 42+ messages in thread From: Nicholas Piggin @ 2016-09-29 1:31 UTC (permalink / raw) To: Peter Zijlstra Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel, linux-mm, Will Deacon, Paul McKenney, Alan Stern On Wed, 28 Sep 2016 09:05:46 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Sep 28, 2016 at 03:06:21AM +1000, Nicholas Piggin wrote: > > On Tue, 27 Sep 2016 18:52:21 +0200 > > Peter Zijlstra <peterz@infradead.org> wrote: > > > > > On Wed, Sep 28, 2016 at 12:53:18AM +1000, Nicholas Piggin wrote: > > > > The more interesting is the ability to avoid the barrier between fastpath > > > > clearing a bit and testing for waiters. > > > > > > > > unlock(): lock() (slowpath): > > > > clear_bit(PG_locked) set_bit(PG_waiter) > > > > test_bit(PG_waiter) test_bit(PG_locked) > > > > > > > > If this was memory ops to different words, it would require smp_mb each > > > > side.. Being the same word, can we avoid them? > > > > > > Ah, that is the reason I put that smp_mb__after_atomic() there. You have > > > a cute point on them being to the same word though. Need to think about > > > that. > > > > This is all assuming the store accesses are ordered, which you should get > > if the stores to the different bits operate on the same address and size. > > That might not be the case for some architectures, but they might not > > require barriers for other reasons. That would call for an smp_mb variant > > that is used for bitops on different bits but same aligned long. > > Since the {set,clear}_bit operations are atomic, they must be ordered > against one another. The subsequent test_bit is a load, which, since its > to the same variable, and a CPU must appear to preserve Program-Order, > must come after the RmW. > > So I think you're right and that we can forgo the memory barriers here. > I even think this must be true on all architectures. In generic code, I don't think so. We'd need an smp_mb__between_bitops_to_the_same_aligned_long, wouldn't we? x86 implements set_bit as 'orb (addr),bit_nr', and compiler could implement test_bit as a byte load as well. If those bits are in different bytes, then they could be reordered, no? ia64 does 32-bit ops. If you make PG_waiter 64-bit only and put it in the different side of the long, then this could be a problem too. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-29 1:31 ` Nicholas Piggin @ 2016-09-29 2:12 ` Paul E. McKenney 2016-09-29 6:21 ` Peter Zijlstra 1 sibling, 0 replies; 42+ messages in thread From: Paul E. McKenney @ 2016-09-29 2:12 UTC (permalink / raw) To: Nicholas Piggin Cc: Peter Zijlstra, Mel Gorman, Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel, linux-mm, Will Deacon, Alan Stern On Thu, Sep 29, 2016 at 11:31:32AM +1000, Nicholas Piggin wrote: > On Wed, 28 Sep 2016 09:05:46 +0200 > Peter Zijlstra <peterz@infradead.org> wrote: > > > On Wed, Sep 28, 2016 at 03:06:21AM +1000, Nicholas Piggin wrote: > > > On Tue, 27 Sep 2016 18:52:21 +0200 > > > Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > > On Wed, Sep 28, 2016 at 12:53:18AM +1000, Nicholas Piggin wrote: > > > > > The more interesting is the ability to avoid the barrier between fastpath > > > > > clearing a bit and testing for waiters. > > > > > > > > > > unlock(): lock() (slowpath): > > > > > clear_bit(PG_locked) set_bit(PG_waiter) > > > > > test_bit(PG_waiter) test_bit(PG_locked) > > > > > > > > > > If this was memory ops to different words, it would require smp_mb each > > > > > side.. Being the same word, can we avoid them? > > > > > > > > Ah, that is the reason I put that smp_mb__after_atomic() there. You have > > > > a cute point on them being to the same word though. Need to think about > > > > that. > > > > > > This is all assuming the store accesses are ordered, which you should get > > > if the stores to the different bits operate on the same address and size. > > > That might not be the case for some architectures, but they might not > > > require barriers for other reasons. That would call for an smp_mb variant > > > that is used for bitops on different bits but same aligned long. > > > > Since the {set,clear}_bit operations are atomic, they must be ordered > > against one another. The subsequent test_bit is a load, which, since its > > to the same variable, and a CPU must appear to preserve Program-Order, > > must come after the RmW. > > > > So I think you're right and that we can forgo the memory barriers here. > > I even think this must be true on all architectures. > > In generic code, I don't think so. We'd need an > smp_mb__between_bitops_to_the_same_aligned_long, wouldn't we? > > x86 implements set_bit as 'orb (addr),bit_nr', and compiler could > implement test_bit as a byte load as well. If those bits are in > different bytes, then they could be reordered, no? > > ia64 does 32-bit ops. If you make PG_waiter 64-bit only and put it > in the different side of the long, then this could be a problem too. Fair point, that would defeat the same-location ordering... Thanx, Paul -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-29 1:31 ` Nicholas Piggin 2016-09-29 2:12 ` Paul E. McKenney @ 2016-09-29 6:21 ` Peter Zijlstra 2016-09-29 6:42 ` Nicholas Piggin 1 sibling, 1 reply; 42+ messages in thread From: Peter Zijlstra @ 2016-09-29 6:21 UTC (permalink / raw) To: Nicholas Piggin Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel, linux-mm, Will Deacon, Paul McKenney, Alan Stern On Thu, Sep 29, 2016 at 11:31:32AM +1000, Nicholas Piggin wrote: > > Since the {set,clear}_bit operations are atomic, they must be ordered > > against one another. The subsequent test_bit is a load, which, since its > > to the same variable, and a CPU must appear to preserve Program-Order, > > must come after the RmW. > > > > So I think you're right and that we can forgo the memory barriers here. > > I even think this must be true on all architectures. > > In generic code, I don't think so. We'd need an > smp_mb__between_bitops_to_the_same_aligned_long, wouldn't we? > > x86 implements set_bit as 'orb (addr),bit_nr', and compiler could > implement test_bit as a byte load as well. If those bits are in > different bytes, then they could be reordered, no? > > ia64 does 32-bit ops. If you make PG_waiter 64-bit only and put it > in the different side of the long, then this could be a problem too. Not on ia64, its atomics are full barriers too, just like x86 (even though its docs imply otherwise). But I get the point. I would however rather audit and attempt to fix affected archs before introducing such a barrier if at all possible. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-29 6:21 ` Peter Zijlstra @ 2016-09-29 6:42 ` Nicholas Piggin 2016-09-29 7:14 ` Peter Zijlstra 0 siblings, 1 reply; 42+ messages in thread From: Nicholas Piggin @ 2016-09-29 6:42 UTC (permalink / raw) To: Peter Zijlstra Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel, linux-mm, Will Deacon, Paul McKenney, Alan Stern On Thu, 29 Sep 2016 08:21:32 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Sep 29, 2016 at 11:31:32AM +1000, Nicholas Piggin wrote: > > > Since the {set,clear}_bit operations are atomic, they must be ordered > > > against one another. The subsequent test_bit is a load, which, since its > > > to the same variable, and a CPU must appear to preserve Program-Order, > > > must come after the RmW. > > > > > > So I think you're right and that we can forgo the memory barriers here. > > > I even think this must be true on all architectures. > > > > In generic code, I don't think so. We'd need an > > smp_mb__between_bitops_to_the_same_aligned_long, wouldn't we? > > > > x86 implements set_bit as 'orb (addr),bit_nr', and compiler could > > implement test_bit as a byte load as well. If those bits are in > > different bytes, then they could be reordered, no? > > > > ia64 does 32-bit ops. If you make PG_waiter 64-bit only and put it > > in the different side of the long, then this could be a problem too. > > Not on ia64, its atomics are full barriers too, just like x86 (even > though its docs imply otherwise). But I get the point. Oh yes of course I knew x86 atomics were barriers :) Take Alpha instead. It's using 32-bit ops. > I would however rather audit and attempt to fix affected archs before > introducing such a barrier if at all possible. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-29 6:42 ` Nicholas Piggin @ 2016-09-29 7:14 ` Peter Zijlstra 2016-09-29 7:43 ` Peter Zijlstra 0 siblings, 1 reply; 42+ messages in thread From: Peter Zijlstra @ 2016-09-29 7:14 UTC (permalink / raw) To: Nicholas Piggin Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel, linux-mm, Will Deacon, Paul McKenney, Alan Stern On Thu, Sep 29, 2016 at 04:42:31PM +1000, Nicholas Piggin wrote: > Take Alpha instead. It's using 32-bit ops. Hmm, my Alpha docs are on the other machine, but I suppose the problem is 64bit immediates (which would be a common problem I suppose, those don't really work well on x86 either). Yes, that does make it all more tricky than desired. OTOH maybe this is a good excuse to (finally) sanitize the bitmap API to use a fixed width. It using 'unsigned long' has been something of a pain at times. With that it would be 'obvious' bits need to be part of the same 32bit word and we can use the normal smp_mb__{before,after}_atomic() again. Given our bitops really are only about single bits, I can't see the change making a real performance difference. OTOH, the non atomic things like weight and ff[sz] do benefit from using the longer words. Bother.. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-29 7:14 ` Peter Zijlstra @ 2016-09-29 7:43 ` Peter Zijlstra 0 siblings, 0 replies; 42+ messages in thread From: Peter Zijlstra @ 2016-09-29 7:43 UTC (permalink / raw) To: Nicholas Piggin Cc: Mel Gorman, Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Rik van Riel, linux-mm, Will Deacon, Paul McKenney, Alan Stern On Thu, Sep 29, 2016 at 09:14:51AM +0200, Peter Zijlstra wrote: > On Thu, Sep 29, 2016 at 04:42:31PM +1000, Nicholas Piggin wrote: > > Take Alpha instead. It's using 32-bit ops. > > Hmm, my Alpha docs are on the other machine, but I suppose the problem > is 64bit immediates (which would be a common problem I suppose, those > don't really work well on x86 either). OK, so from the architecture that have 64bit support, I think Alpha is the only one that uses 32bit ops and cares. alpha is weak and uses 32bit ops (fail) arm64 is weak but uses 64bit ops ia64 has full barriers and 64bit ops mips is weak but uses 64bit ops parisc is horrid but uses 64bit ops powerpc is weak but uses 64bit ops s390 has full barriers and uses 64bit ops sparc has full barriers and uses 64bit ops (if I read the asm right) tile is weak but uses 64bit ops x86 has full barriers and uses byte ops So we could just fix Alpha.. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: page_waitqueue() considered harmful 2016-09-27 14:53 ` Nicholas Piggin 2016-09-27 15:17 ` Nicholas Piggin 2016-09-27 16:52 ` Peter Zijlstra @ 2016-09-28 7:40 ` Mel Gorman 2 siblings, 0 replies; 42+ messages in thread From: Mel Gorman @ 2016-09-28 7:40 UTC (permalink / raw) To: Nicholas Piggin Cc: Linus Torvalds, Andrew Morton, Kirill A. Shutemov, Johannes Weiner, Jan Kara, Peter Zijlstra, Rik van Riel, linux-mm On Wed, Sep 28, 2016 at 12:53:18AM +1000, Nicholas Piggin wrote: > > The patch is heavily derived from work by Nick Piggin who noticed the years > > before that. I think that was the last version I posted and the changelog > > includes profile data. I don't have an exact reference why it was rejected > > but a consistent piece of feedback was that it was very complex for the > > level of impact it had. > > Huh, I was just wondering about this again the other day. Powerpc has > some interesting issues with atomic ops and barriers (not to mention > random cache misses that hurt everybody). > > It actually wasn't for big Altix machines (at least not when I wrote it), > but it came from some effort to optimize page reclaim performance on an > opteron with a lot (back then) of cores per node. > The reason SUSE carried the patch for a while, albeit not right now, was because of the large machines and them being the only people that could provide concrete support. Over time, the benefit dropped. It was always the case the benefit of the patch could be measured from profiles but rarely visible in the context of the overall workload or buried deep within the noise. The final structure of the patch was partially driven by fixes for subtle bugs discovered in the patch over a long period of time. Maybe the new approaches will both avoid those bugs and be visible on "real" workloads. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 42+ messages in thread
end of thread, other threads:[~2016-10-03 10:47 UTC | newest] Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-09-26 20:58 page_waitqueue() considered harmful Linus Torvalds 2016-09-26 21:23 ` Rik van Riel 2016-09-26 21:30 ` Linus Torvalds 2016-09-26 23:11 ` Kirill A. Shutemov 2016-09-27 1:01 ` Rik van Riel 2016-09-27 7:30 ` Peter Zijlstra 2016-09-27 8:54 ` Mel Gorman 2016-09-27 9:11 ` Kirill A. Shutemov 2016-09-27 9:42 ` Mel Gorman 2016-09-27 9:52 ` Minchan Kim 2016-09-27 12:11 ` Kirill A. Shutemov 2016-09-29 8:01 ` Peter Zijlstra 2016-09-29 12:55 ` Nicholas Piggin 2016-09-29 13:16 ` Peter Zijlstra 2016-09-29 13:54 ` Nicholas Piggin 2016-09-29 15:05 ` Rik van Riel 2016-09-27 8:03 ` Jan Kara 2016-09-27 8:31 ` Mel Gorman 2016-09-27 14:34 ` Peter Zijlstra 2016-09-27 15:08 ` Nicholas Piggin 2016-09-27 16:31 ` Linus Torvalds 2016-09-27 16:49 ` Peter Zijlstra 2016-09-28 10:45 ` Mel Gorman 2016-09-28 11:11 ` Peter Zijlstra 2016-09-28 16:10 ` Linus Torvalds 2016-09-29 13:08 ` Peter Zijlstra 2016-10-03 10:47 ` Mel Gorman 2016-09-27 14:53 ` Nicholas Piggin 2016-09-27 15:17 ` Nicholas Piggin 2016-09-27 16:52 ` Peter Zijlstra 2016-09-27 17:06 ` Nicholas Piggin 2016-09-28 7:05 ` Peter Zijlstra 2016-09-28 11:05 ` Paul E. McKenney 2016-09-28 11:16 ` Peter Zijlstra 2016-09-28 12:58 ` Paul E. McKenney 2016-09-29 1:31 ` Nicholas Piggin 2016-09-29 2:12 ` Paul E. McKenney 2016-09-29 6:21 ` Peter Zijlstra 2016-09-29 6:42 ` Nicholas Piggin 2016-09-29 7:14 ` Peter Zijlstra 2016-09-29 7:43 ` Peter Zijlstra 2016-09-28 7:40 ` Mel Gorman
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.