From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <3D2DEDAD.A38AFF25@zip.com.au> Date: Thu, 11 Jul 2002 13:42:21 -0700 From: Andrew Morton MIME-Version: 1.0 Subject: Re: [PATCH] Optimize out pte_chain take three References: <3D2DE264.17706BB4@zip.com.au> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Rik van Riel Cc: William Lee Irwin III , Dave McCracken , Linux Memory Management List-ID: Rik van Riel wrote: > > ... > > useful pagecache and swapping everything out. Our kernels have > > O_STREAMING because of this. It simply removes as much pagecache > > as it can, each time ->nrpages reaches 256. It's rather effective. > > Now why does that remind me of drop-behind ? ;) I looked at 2.4-ac as well. Seems that the dropbehind there only addresses reads? This is a specialised application and frankly, I don't think magical voodoo kernel logic will ever work as well as exposing capabilities to the application. The posix_fadvise() API is basically ideal for this, but it's quite hard for Linux to implement efficiently. How do we efficiently discard the 10,000 pages starting at page offset 25,000,000? We can do that in O(not much) time with radix_tree_gang_lookup(void **pointers, int how_many, int starting_offset) but that hasn't been written. It would make truncate/invalidate_inode_pages tons faster and cleaner too. > > I installed 2.5.25+rmap on my desktop yesterday. Come in this morning > > to discover half of memory is inodes, quarter of memory is dentries and > > I'm 40 megs into swap. Sigh. > > As requested by Linus, this patch only has the mechanism > and none of the balancing changes. > > I suspect Ed Tomlinson's patch will fix this issue. yup. btw, I was looking into many-spindle writeback performance yesterday. It's pretty bad. Test case is simply four disks, four ext2 filesytems, four processes flat-out writing to each disk. Throughput is only 60% of O_DIRECT because one of the disk's queues fills up and everybody ends up blocking on that queue. 2.4 has the same problem, and it's basically unsolvable there because of the global buffer LRU. In 2.5, the balance_dirty() path is trivially solved by making the caller of balance_dirty_pages only write back data against the superblock which he just dirtied. However unless I set the dirty memory thresholds super-low so that in fact none of the queues ever fills, we still hit the same interqueue contention in the page reclaim code. I was scratching my head over this for some time: how come there are dirty pages at the tail of the LRU, when the inactive list is quite enormous? I need to confirm this, but I suspect it's metadata: we're moving pages to the head of the LRU when they are first added to the inode, and when writeback is started. But we're *not* performing that motion when the fs does mark_buffer_dirty(bitmap block), for example. So that dirty-against-a-full-queue bitmap block is a little timebomb, worming its way to the head of the LRU. Probably, a touch_buffer() in mark_buffer_dirty() will plug this, but that's even more atomic operations, even more banging on the pagemap_lru_lock. I suspect the best fix here is to not have dirty or writeback pagecache pages on the LRU at all. Throttle on memory coming reclaimable, put the pages back on the LRU when they're clean, etc. As we have often discussed. Big change. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/