From: Dave Chinner <david@fromorbit.com> To: Mel Gorman <mgorman@suse.de> Cc: jack@suse.cz, xfs@oss.sgi.com, Christoph Hellwig <hch@infradead.org>, linux-mm@kvack.org, Wu Fengguang <fengguang.wu@intel.com>, Johannes Weiner <jweiner@redhat.com> Subject: Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering Date: Wed, 6 Jul 2011 01:55:42 +1000 [thread overview] Message-ID: <20110705155542.GG1026@dastard> (raw) In-Reply-To: <20110705141016.GA15285@suse.de> On Tue, Jul 05, 2011 at 03:10:16PM +0100, Mel Gorman wrote: > On Sat, Jul 02, 2011 at 12:42:19PM +1000, Dave Chinner wrote: > > On Fri, Jul 01, 2011 at 03:59:35PM +0100, Mel Gorman wrote: > > BTW, called a workload "fsmark" tells us nothing about the workload > > being tested - fsmark can do a lot of interesting things. IOWs, you > > need to quote the command line for it to be meaningful to anyone... > > > > My bad. > > ./fs_mark -d /tmp/fsmark-14880 -D 225 -N 22500 -n 3125 -L 15 -t 16 -S0 -s 131072 Ok, so 16 threads, 3125 files per thread, 128k per file, all created in to the same directory which rolls over when it gets to 22500 files in the directory. Yeah, it generates a bit of memory pressure, but I think the file sizes are too small to really stress writeback much. You need to use files that are at least 10MB in size to really start to mix up the writeback lists and the way they juggle new and old inodes to try not to starve any particular inode of writeback bandwidth.... Also, I don't use the "-t <num>" threading mechanism because all it does is bash on the directory mutex without really improving parallelism for creates. perf top on my system shows: samples pcnt function DSO _______ _____ __________________________________ __________________________________ 2799.00 9.3% mutex_spin_on_owner [kernel.kallsyms] 2049.00 6.8% copy_user_generic_string [kernel.kallsyms] 1912.00 6.3% _raw_spin_unlock_irqrestore [kernel.kallsyms] A contended mutex as the prime CPU consumer. That's more CPU than copying 750MB/s of data. Hence I normally drive parallelism with fsmark by using multiple "-d <dir>" options, which runs a thread per directory and a workload unit per directory and so you don't get directory mutex contention causing serialisation and interference with what you are really trying to measure.... > > > As I look through the results I have at the moment, the number of > > > pages written back was simply really low which is why the problem fell > > > off my radar. > > > > It doesn't take many to completely screw up writeback IO patterns. > > Write a few random pages to a 10MB file well before writeback would > > get to the file, and instead of getting optimal sequential writeback > > patterns when writeback gets to it, we get multiple disjoint IOs > > that require multiple seeks to complete. > > > > Slower, less efficient writeback IO causes memory pressure to last > > longer and hence more likely to result in kswapd writeback, and it's > > just a downward spiral from there.... > > > > Yes, I see the negative feedback loop. This has always been a struggle > in that kswapd needs pages from a particular zone to be cleaned and > freed but calling writepage can make things slower. There were > prototypes in the past to give hints to the flusher threads on what > inode and pages to be freed and they were never met with any degree of > satisfaction. > > The consensus (amount VM people at least) was as long as that number was > low, it wasn't much of a problem. Therein lies the problem. You've got storage people telling you there is an IO problem with memory reclaim, but the mm community then put their heads together somewhere private, decide it isn't a problem worth fixing and do nothing. Rinse, lather, repeat. I expect memory reclaim to play nicely with writeback that is already in progress. These subsystems do not work in isolation, yet memory reclaim treats it that way - as though it is the most important IO submitter and everything else can suffer while memory reclaim does it's stuff. Memory reclaim needs to co-ordinate with writeback effectively for the system as a whole to work well together. > I know you disagree. Right, that's because it doesn't have to be a very high number to be a problem. IO is orders of magnitude slower than the CPU time it takes to flush a page, so the cost of making a bad flush decision is very high. And single page writeback from the LRU is almost always a bad flush decision. > > > > > Oh, now that is too close to just be a co-incidence. We're getting > > > > > significant amounts of random page writeback from the the ends of > > > > > the LRUs done by the VM. > > > > > > > > > > <sigh> > > > > > > Does the value for nr_vmscan_write in /proc/vmstat correlate? It must > > > but lets me sure because I'm using that figure rather than ftrace to > > > count writebacks at the moment. > > > > The number in /proc/vmstat is higher. Much higher. I just ran the > > test at 1000 files (only collapsed to ~3000 iops this time because I > > ran it on a plain 3.0-rc4 kernel that still has the .writepage > > clustering in XFS), and I see: > > > > nr_vmscan_write 6723 > > > > after the test. The event trace only capture ~1400 writepage events > > from kswapd, but it tends to miss a lot of events as the system is > > quite unresponsive at times under this workload - it's not uncommon > > to have ssh sessions not echo a character for 10s... e.g: I started > > the workload ~11:08:22: > > > > Ok, I'll be looking at nr_vmscan_write as the basis for "badness". Perhaps you should look at my other reply (and two line "fix") in the thread about stopping dirty page writeback until after waiting on pages under writeback..... > > > A more relevant question is this - > > > how many pages were reclaimed by kswapd and what percentage is 799 > > > pages of that? What do you consider an acceptable percentage? > > > > I don't care what the percentage is or what the number is. kswapd is > > reclaiming pages most of the time without affect IO patterns, and > > when that happens I just don't care because it is working just fine. > > > > I do care. I'm looking at some early XFS results here based on a laptop > (4G). For fsmark with the command line above, the number of pages > written back by kswapd was 0. The worst test by far was sysbench using a > particularly large database. The number of writes was 48745 which is > 0.27% of pages scanned or 0.28% of pages reclaimed. Ordinarily I would > ignore that. > > If I run this at 1G and get a similar ratio, I will assume that I > am not reproducing your problem at all unless I know what ratio you > are seeing. Single threaded writing of files should -never- cause writeback from the LRUs. If that is happening, then the memory reclaim throttling is broken. See my other email. > So .... How many pages were reclaimed by kswapd and what percentage > is 799 pages of that? No idea. That information is long gone.... > You answered my second question. You consider 0% to be the acceptable > percentage. No, I expect memory reclaim to behave nicely with writeback that is already in progress. This subsystems do not work in isolation - they need to co-ordinate > > What I care about is what kswapd is doing when it finds dirty pages > > and it decides they need to be written back. It's not a problem that > > they are found or need to be written, the problem is the utterly > > crap way that memory reclaim is throwing the pages at the filesystem. > > > > I'm not sure how to get through to you guys that single, random page > > writeback is *BAD*. > > It got through. The feedback during discussions on the VM side was > that as long as the percentage was sufficiently low it wasn't a problem > because on occasion, the VM really needs pages from a particular zone. > A solution that addressed both problems has never been agreed on and > energy and time runs out before it gets fixed each time. <sigh> > > And while I'm ranting, when on earth is the issue-writeback-from- > > direct-reclaim problem going to be fixed so we can remove the hacks > > in the filesystem .writepage implementations to prevent this from > > occurring? > > > > Prototyped that too, same thread. Same type of problem, writeback > from direct reclaim should happen so rarely that it should not be > optimised for. See https://lkml.org/lkml/2010/6/11/32 Writeback from direct reclaim crashes systems by causing stack overruns - that's why we've disabled it. It's not an "optimisation" problem - it's a _memory corruption_ bug that needs to be fixed..... > At the risk of pissing you off, this isn't new information so I'll > consider myself duly nudged into revisiting. No, I've had a rant to express my displeasure at the lack of progress on this front. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs
WARNING: multiple messages have this Message-ID (diff)
From: Dave Chinner <david@fromorbit.com> To: Mel Gorman <mgorman@suse.de> Cc: Christoph Hellwig <hch@infradead.org>, Johannes Weiner <jweiner@redhat.com>, Wu Fengguang <fengguang.wu@intel.com>, xfs@oss.sgi.com, jack@suse.cz, linux-mm@kvack.org Subject: Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering Date: Wed, 6 Jul 2011 01:55:42 +1000 [thread overview] Message-ID: <20110705155542.GG1026@dastard> (raw) In-Reply-To: <20110705141016.GA15285@suse.de> On Tue, Jul 05, 2011 at 03:10:16PM +0100, Mel Gorman wrote: > On Sat, Jul 02, 2011 at 12:42:19PM +1000, Dave Chinner wrote: > > On Fri, Jul 01, 2011 at 03:59:35PM +0100, Mel Gorman wrote: > > BTW, called a workload "fsmark" tells us nothing about the workload > > being tested - fsmark can do a lot of interesting things. IOWs, you > > need to quote the command line for it to be meaningful to anyone... > > > > My bad. > > ./fs_mark -d /tmp/fsmark-14880 -D 225 -N 22500 -n 3125 -L 15 -t 16 -S0 -s 131072 Ok, so 16 threads, 3125 files per thread, 128k per file, all created in to the same directory which rolls over when it gets to 22500 files in the directory. Yeah, it generates a bit of memory pressure, but I think the file sizes are too small to really stress writeback much. You need to use files that are at least 10MB in size to really start to mix up the writeback lists and the way they juggle new and old inodes to try not to starve any particular inode of writeback bandwidth.... Also, I don't use the "-t <num>" threading mechanism because all it does is bash on the directory mutex without really improving parallelism for creates. perf top on my system shows: samples pcnt function DSO _______ _____ __________________________________ __________________________________ 2799.00 9.3% mutex_spin_on_owner [kernel.kallsyms] 2049.00 6.8% copy_user_generic_string [kernel.kallsyms] 1912.00 6.3% _raw_spin_unlock_irqrestore [kernel.kallsyms] A contended mutex as the prime CPU consumer. That's more CPU than copying 750MB/s of data. Hence I normally drive parallelism with fsmark by using multiple "-d <dir>" options, which runs a thread per directory and a workload unit per directory and so you don't get directory mutex contention causing serialisation and interference with what you are really trying to measure.... > > > As I look through the results I have at the moment, the number of > > > pages written back was simply really low which is why the problem fell > > > off my radar. > > > > It doesn't take many to completely screw up writeback IO patterns. > > Write a few random pages to a 10MB file well before writeback would > > get to the file, and instead of getting optimal sequential writeback > > patterns when writeback gets to it, we get multiple disjoint IOs > > that require multiple seeks to complete. > > > > Slower, less efficient writeback IO causes memory pressure to last > > longer and hence more likely to result in kswapd writeback, and it's > > just a downward spiral from there.... > > > > Yes, I see the negative feedback loop. This has always been a struggle > in that kswapd needs pages from a particular zone to be cleaned and > freed but calling writepage can make things slower. There were > prototypes in the past to give hints to the flusher threads on what > inode and pages to be freed and they were never met with any degree of > satisfaction. > > The consensus (amount VM people at least) was as long as that number was > low, it wasn't much of a problem. Therein lies the problem. You've got storage people telling you there is an IO problem with memory reclaim, but the mm community then put their heads together somewhere private, decide it isn't a problem worth fixing and do nothing. Rinse, lather, repeat. I expect memory reclaim to play nicely with writeback that is already in progress. These subsystems do not work in isolation, yet memory reclaim treats it that way - as though it is the most important IO submitter and everything else can suffer while memory reclaim does it's stuff. Memory reclaim needs to co-ordinate with writeback effectively for the system as a whole to work well together. > I know you disagree. Right, that's because it doesn't have to be a very high number to be a problem. IO is orders of magnitude slower than the CPU time it takes to flush a page, so the cost of making a bad flush decision is very high. And single page writeback from the LRU is almost always a bad flush decision. > > > > > Oh, now that is too close to just be a co-incidence. We're getting > > > > > significant amounts of random page writeback from the the ends of > > > > > the LRUs done by the VM. > > > > > > > > > > <sigh> > > > > > > Does the value for nr_vmscan_write in /proc/vmstat correlate? It must > > > but lets me sure because I'm using that figure rather than ftrace to > > > count writebacks at the moment. > > > > The number in /proc/vmstat is higher. Much higher. I just ran the > > test at 1000 files (only collapsed to ~3000 iops this time because I > > ran it on a plain 3.0-rc4 kernel that still has the .writepage > > clustering in XFS), and I see: > > > > nr_vmscan_write 6723 > > > > after the test. The event trace only capture ~1400 writepage events > > from kswapd, but it tends to miss a lot of events as the system is > > quite unresponsive at times under this workload - it's not uncommon > > to have ssh sessions not echo a character for 10s... e.g: I started > > the workload ~11:08:22: > > > > Ok, I'll be looking at nr_vmscan_write as the basis for "badness". Perhaps you should look at my other reply (and two line "fix") in the thread about stopping dirty page writeback until after waiting on pages under writeback..... > > > A more relevant question is this - > > > how many pages were reclaimed by kswapd and what percentage is 799 > > > pages of that? What do you consider an acceptable percentage? > > > > I don't care what the percentage is or what the number is. kswapd is > > reclaiming pages most of the time without affect IO patterns, and > > when that happens I just don't care because it is working just fine. > > > > I do care. I'm looking at some early XFS results here based on a laptop > (4G). For fsmark with the command line above, the number of pages > written back by kswapd was 0. The worst test by far was sysbench using a > particularly large database. The number of writes was 48745 which is > 0.27% of pages scanned or 0.28% of pages reclaimed. Ordinarily I would > ignore that. > > If I run this at 1G and get a similar ratio, I will assume that I > am not reproducing your problem at all unless I know what ratio you > are seeing. Single threaded writing of files should -never- cause writeback from the LRUs. If that is happening, then the memory reclaim throttling is broken. See my other email. > So .... How many pages were reclaimed by kswapd and what percentage > is 799 pages of that? No idea. That information is long gone.... > You answered my second question. You consider 0% to be the acceptable > percentage. No, I expect memory reclaim to behave nicely with writeback that is already in progress. This subsystems do not work in isolation - they need to co-ordinate > > What I care about is what kswapd is doing when it finds dirty pages > > and it decides they need to be written back. It's not a problem that > > they are found or need to be written, the problem is the utterly > > crap way that memory reclaim is throwing the pages at the filesystem. > > > > I'm not sure how to get through to you guys that single, random page > > writeback is *BAD*. > > It got through. The feedback during discussions on the VM side was > that as long as the percentage was sufficiently low it wasn't a problem > because on occasion, the VM really needs pages from a particular zone. > A solution that addressed both problems has never been agreed on and > energy and time runs out before it gets fixed each time. <sigh> > > And while I'm ranting, when on earth is the issue-writeback-from- > > direct-reclaim problem going to be fixed so we can remove the hacks > > in the filesystem .writepage implementations to prevent this from > > occurring? > > > > Prototyped that too, same thread. Same type of problem, writeback > from direct reclaim should happen so rarely that it should not be > optimised for. See https://lkml.org/lkml/2010/6/11/32 Writeback from direct reclaim crashes systems by causing stack overruns - that's why we've disabled it. It's not an "optimisation" problem - it's a _memory corruption_ bug that needs to be fixed..... > At the risk of pissing you off, this isn't new information so I'll > consider myself duly nudged into revisiting. No, I've had a rant to express my displeasure at the lack of progress on this front. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2011-07-05 15:55 UTC|newest] Thread overview: 100+ messages / expand[flat|nested] mbox.gz Atom feed top 2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig 2011-06-29 14:01 ` [PATCH 01/27] xfs: PF_FSTRANS should never be set in ->writepage Christoph Hellwig 2011-06-30 1:34 ` Dave Chinner 2011-06-29 14:01 ` [PATCH 02/27] xfs: remove the unused ilock_nowait codepath in writepage Christoph Hellwig 2011-06-30 0:15 ` Dave Chinner 2011-06-30 1:26 ` Dave Chinner 2011-06-30 6:55 ` Christoph Hellwig 2011-06-29 14:01 ` [PATCH 03/27] xfs: use write_cache_pages for writeback clustering Christoph Hellwig 2011-06-30 2:00 ` Dave Chinner 2011-06-30 2:48 ` Dave Chinner 2011-06-30 6:57 ` Christoph Hellwig 2011-07-01 2:22 ` Dave Chinner 2011-07-01 4:18 ` Dave Chinner 2011-07-01 8:59 ` Christoph Hellwig 2011-07-01 9:20 ` Dave Chinner 2011-07-01 9:33 ` Christoph Hellwig 2011-07-01 9:33 ` Christoph Hellwig 2011-07-01 14:59 ` Mel Gorman 2011-07-01 14:59 ` Mel Gorman 2011-07-01 15:15 ` Christoph Hellwig 2011-07-01 15:15 ` Christoph Hellwig 2011-07-02 2:42 ` Dave Chinner 2011-07-02 2:42 ` Dave Chinner 2011-07-05 14:10 ` Mel Gorman 2011-07-05 14:10 ` Mel Gorman 2011-07-05 15:55 ` Dave Chinner [this message] 2011-07-05 15:55 ` Dave Chinner 2011-07-11 10:26 ` Christoph Hellwig 2011-07-11 10:26 ` Christoph Hellwig 2011-07-01 15:41 ` Wu Fengguang 2011-07-01 15:41 ` Wu Fengguang 2011-07-04 3:25 ` Dave Chinner 2011-07-04 3:25 ` Dave Chinner 2011-07-05 14:34 ` Mel Gorman 2011-07-05 14:34 ` Mel Gorman 2011-07-06 1:23 ` Dave Chinner 2011-07-06 1:23 ` Dave Chinner 2011-07-11 11:10 ` Christoph Hellwig 2011-07-11 11:10 ` Christoph Hellwig 2011-07-06 4:53 ` Wu Fengguang 2011-07-06 4:53 ` Wu Fengguang 2011-07-06 6:47 ` Minchan Kim 2011-07-06 6:47 ` Minchan Kim 2011-07-06 7:17 ` Dave Chinner 2011-07-06 7:17 ` Dave Chinner 2011-07-06 15:12 ` Johannes Weiner 2011-07-06 15:12 ` Johannes Weiner 2011-07-08 9:54 ` Dave Chinner 2011-07-08 9:54 ` Dave Chinner 2011-07-11 17:20 ` Johannes Weiner 2011-07-11 17:20 ` Johannes Weiner 2011-07-11 17:24 ` Christoph Hellwig 2011-07-11 17:24 ` Christoph Hellwig 2011-07-11 19:09 ` Rik van Riel 2011-07-11 19:09 ` Rik van Riel 2011-07-01 8:51 ` Christoph Hellwig 2011-06-29 14:01 ` [PATCH 04/27] xfs: cleanup xfs_add_to_ioend Christoph Hellwig 2011-06-29 22:13 ` Alex Elder 2011-06-30 2:00 ` Dave Chinner 2011-06-29 14:01 ` [PATCH 05/27] xfs: work around bogus gcc warning in xfs_allocbt_init_cursor Christoph Hellwig 2011-06-29 22:13 ` Alex Elder 2011-06-29 14:01 ` [PATCH 06/27] xfs: split xfs_setattr Christoph Hellwig 2011-06-29 22:13 ` Alex Elder 2011-06-30 7:03 ` Christoph Hellwig 2011-06-30 12:28 ` Alex Elder 2011-06-30 2:11 ` Dave Chinner 2011-06-29 14:01 ` [PATCH 08/27] xfs: kill xfs_itruncate_start Christoph Hellwig 2011-06-29 22:13 ` Alex Elder 2011-06-29 14:01 ` [PATCH 09/27] xfs: split xfs_itruncate_finish Christoph Hellwig 2011-06-30 2:44 ` Dave Chinner 2011-06-30 7:18 ` Christoph Hellwig 2011-06-29 14:01 ` [PATCH 10/27] xfs: improve sync behaviour in the fact of aggressive dirtying Christoph Hellwig 2011-06-30 2:52 ` Dave Chinner 2011-06-29 14:01 ` [PATCH 11/27] xfs: fix filesystsem freeze race in xfs_trans_alloc Christoph Hellwig 2011-06-30 2:59 ` Dave Chinner 2011-06-29 14:01 ` [PATCH 12/27] xfs: remove i_transp Christoph Hellwig 2011-06-30 3:00 ` Dave Chinner 2011-06-29 14:01 ` [PATCH 13/27] xfs: factor out xfs_dir2_leaf_find_entry Christoph Hellwig 2011-06-30 6:11 ` Dave Chinner 2011-06-30 7:34 ` Christoph Hellwig 2011-06-29 14:01 ` [PATCH 14/27] xfs: cleanup shortform directory inode number handling Christoph Hellwig 2011-06-30 6:35 ` Dave Chinner 2011-06-30 7:39 ` Christoph Hellwig 2011-06-29 14:01 ` [PATCH 15/27] xfs: kill struct xfs_dir2_sf Christoph Hellwig 2011-06-30 7:04 ` Dave Chinner 2011-06-30 7:09 ` Christoph Hellwig 2011-06-29 14:01 ` [PATCH 16/27] xfs: cleanup the defintion of struct xfs_dir2_sf_entry Christoph Hellwig 2011-06-29 14:01 ` [PATCH 17/27] xfs: avoid usage of struct xfs_dir2_block Christoph Hellwig 2011-06-29 14:01 ` [PATCH 18/27] xfs: kill " Christoph Hellwig 2011-06-29 14:01 ` [PATCH 19/27] xfs: avoid usage of struct xfs_dir2_data Christoph Hellwig 2011-06-29 14:01 ` [PATCH 20/27] xfs: kill " Christoph Hellwig 2011-06-29 14:01 ` [PATCH 21/27] xfs: cleanup the defintion of struct xfs_dir2_data_entry Christoph Hellwig 2011-06-29 14:01 ` [PATCH 22/27] xfs: cleanup struct xfs_dir2_leaf Christoph Hellwig 2011-06-29 14:01 ` [PATCH 23/27] xfs: remove the unused xfs_bufhash structure Christoph Hellwig 2011-06-29 14:01 ` [PATCH 24/27] xfs: clean up buffer locking helpers Christoph Hellwig 2011-06-29 14:01 ` [PATCH 25/27] xfs: return the buffer locked from xfs_buf_get_uncached Christoph Hellwig 2011-06-29 14:01 ` [PATCH 26/27] xfs: cleanup I/O-related buffer flags Christoph Hellwig 2011-06-29 14:01 ` [PATCH 27/27] xfs: avoid a few disk cache flushes Christoph Hellwig 2011-06-30 6:36 ` [PATCH 00/27] patch queue for Linux 3.1 Dave Chinner 2011-06-30 6:50 ` Christoph Hellwig
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20110705155542.GG1026@dastard \ --to=david@fromorbit.com \ --cc=fengguang.wu@intel.com \ --cc=hch@infradead.org \ --cc=jack@suse.cz \ --cc=jweiner@redhat.com \ --cc=linux-mm@kvack.org \ --cc=mgorman@suse.de \ --cc=xfs@oss.sgi.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.