linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 0/8] drop the mmap_sem when doing IO in the fault path
@ 2018-09-25 15:30 Josef Bacik
  2018-09-25 15:30 ` [PATCH 1/8] mm: push vm_fault into the page fault handlers Josef Bacik
                   ` (7 more replies)
  0 siblings, 8 replies; 12+ messages in thread
From: Josef Bacik @ 2018-09-25 15:30 UTC (permalink / raw)
  To: akpm, linux-kernel, kernel-team, linux-btrfs, riel, hannes, tj,
	linux-mm, linux-fsdevel

Now that we have proper isolation in place with cgroups2 we have started going
through and fixing the various priority inversions.  Most are all gone now, but
this one is sort of weird since it's not necessarily a priority inversion that
happens within the kernel, but rather because of something userspace does.

We have giant applications that we want to protect, and parts of these giant
applications do things like watch the system state to determine how healthy the
box is for load balancing and such.  This involves running 'ps' or other such
utilities.  These utilities will often walk /proc/<pid>/whatever, and these
files can sometimes need to down_read(&task->mmap_sem).  Not usually a big deal,
but we noticed when we are stress testing that sometimes our protected
application has latency spikes trying to get the mmap_sem for tasks that are in
lower priority cgroups.

This is because any down_write() on a semaphore essentially turns it into a
mutex, so even if we currently have it held for reading, any new readers will
not be allowed on to keep from starving the writer.  This is fine, except a
lower priority task could be stuck doing IO because it has been throttled to the
point that its IO is taking much longer than normal.  But because a higher
priority group depends on this completing it is now stuck behind lower priority
work.

In order to avoid this particular priority inversion we want to use the existing
retry mechanism to stop from holding the mmap_sem at all if we are going to do
IO.  This already exists in the read case sort of, but needed to be extended for
more than just grabbing the page lock.  With io.latency we throttle at
submit_bio() time, so the readahead stuff can block and even page_cache_read can
block, so all these paths need to have the mmap_sem dropped.

The other big thing is ->page_mkwrite.  btrfs is particularly shitty here
because we have to reserve space for the dirty page, which can be a very
expensive operation.  We use the same retry method as the read path, and simply
cache the page and verify the page is still setup properly the next pass through
->page_mkwrite().

I've tested these patches with xfstests and there are no regressions.  Let me
know what you think.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-09-26  1:33 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-25 15:30 [RFC][PATCH 0/8] drop the mmap_sem when doing IO in the fault path Josef Bacik
2018-09-25 15:30 ` [PATCH 1/8] mm: push vm_fault into the page fault handlers Josef Bacik
2018-09-26  0:22   ` Dave Chinner
2018-09-26  1:33     ` Josef Bacik
2018-09-25 15:30 ` [PATCH 2/8] mm: drop mmap_sem for page cache read IO submission Josef Bacik
2018-09-25 15:30 ` [PATCH 3/8] mm: clean up swapcache lookup and creation function names Josef Bacik
2018-09-25 15:30 ` [PATCH 4/8] mm: drop mmap_sem for swap read IO submission Josef Bacik
2018-09-25 15:30 ` [PATCH 5/8] mm: drop the mmap_sem in all read fault cases Josef Bacik
2018-09-25 15:30 ` [PATCH 6/8] mm: keep the page we read for the next loop Josef Bacik
2018-09-25 15:30 ` [PATCH 7/8] mm: add a flag to indicate we used a cached page Josef Bacik
2018-09-25 15:30 ` [PATCH 8/8] btrfs: drop mmap_sem in mkwrite for btrfs Josef Bacik
2018-09-26  0:24   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).