On Thu, Aug 11, 2016 at 09:55:33AM -0700, Linus Torvalds wrote: > On Thu, Aug 11, 2016 at 8:57 AM, Christoph Hellwig wrote: > > > > The one liner below (not tested yet) to simply remove it should fix that > > up. I also noticed we have a spurious pagefault_disable/enable, I > > need to dig into the history of that first, though. > > Hopefully the pagefault_disable/enable doesn't matter for this case. > > Can we get this one-liner tested with the kernel robot for comparison? > I really think a messed-up LRU list could cause bad IO patterns, and > end up keeping dirty pages around that should be streaming out to disk > and re-used, so causing memory pressure etc for no good reason. > > I think the mapping->tree_lock issue that Dave sees is interesting > too, but the kswapd activity (and the extra locking it causes) could > also be a symptom of the same thing - memory pressure due to just > putting pages in the active file that simply shouldn't be there. So, removing mark_page_accessed() made the spinlock contention *worse*. 36.51% [kernel] [k] _raw_spin_unlock_irqrestore 6.27% [kernel] [k] copy_user_generic_string 3.73% [kernel] [k] _raw_spin_unlock_irq 3.55% [kernel] [k] get_page_from_freelist 1.97% [kernel] [k] do_raw_spin_lock 1.72% [kernel] [k] __block_commit_write.isra.30 1.44% [kernel] [k] __wake_up_bit 1.41% [kernel] [k] shrink_page_list 1.24% [kernel] [k] __radix_tree_lookup 1.03% [kernel] [k] xfs_log_commit_cil 0.99% [kernel] [k] free_hot_cold_page 0.96% [kernel] [k] end_buffer_async_write 0.95% [kernel] [k] delay_tsc 0.94% [kernel] [k] ___might_sleep 0.93% [kernel] [k] kmem_cache_alloc 0.90% [kernel] [k] unlock_page 0.82% [kernel] [k] kmem_cache_free 0.74% [kernel] [k] up_write 0.72% [kernel] [k] node_dirty_ok 0.66% [kernel] [k] clear_page_dirty_for_io 0.65% [kernel] [k] __mark_inode_dirty 0.64% [kernel] [k] __block_write_begin_int 0.58% [kernel] [k] xfs_inode_item_format 0.57% [kernel] [k] __memset 0.57% [kernel] [k] cancel_dirty_page 0.56% [kernel] [k] down_write 0.54% [kernel] [k] page_evictable 0.53% [kernel] [k] page_mapping 0.52% [kernel] [k] __slab_free 0.49% [kernel] [k] xfs_do_writepage 0.49% [kernel] [k] drop_buffers - 41.82% 41.82% [kernel] [k] _raw_spin_unlock_irqrestore - 35.93% ret_from_fork - kthread - 29.76% kswapd shrink_node shrink_node_memcg.isra.75 shrink_inactive_list shrink_page_list __remove_mapping _raw_spin_unlock_irqrestore - 7.13% worker_thread - process_one_work - 4.40% wb_workfn wb_writeback __writeback_inodes_wb writeback_sb_inodes __writeback_single_inode do_writepages xfs_vm_writepages write_cache_pages xfs_do_writepage - 2.71% xfs_end_io xfs_destroy_ioend end_buffer_async_write end_page_writeback test_clear_page_writeback _raw_spin_unlock_irqrestore + 4.88% __libc_pwrite The kswapd contention has jumped from 20% to 30% of the CPU time in the profiles. I can't see how changing what LRU the page is on will improve the contention problem - at it's sources it's a N:1 problem where the writing process and N kswapd worker threads are all trying to access the same lock concurrently.... This is not the AIM7 problem we are looking for - what this test demonstrates is a fundamental page cache scalability issue at the design level - the mapping->tree_lock is a global serialisation point.... I'm now going to test Christoph's theory that this is an "overwrite doing lots of block mapping" issue. More on that to follow. Cheers, Dave. -- Dave Chinner david@fromorbit.com