[PATCH RESEND] block: annotate refault stalls from IO submission

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RESEND] block: annotate refault stalls from IO submission
@ 2019-08-08 19:03 Johannes Weiner
  2019-08-09 22:12 ` Dave Chinner
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Johannes Weiner @ 2019-08-08 19:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Dave Chinner, Andrew Morton, linux-mm, linux-btrfs, linux-ext4,
	linux-fsdevel, linux-block, linux-kernel

psi tracks the time tasks wait for refaulting pages to become
uptodate, but it does not track the time spent submitting the IO. The
submission part can be significant if backing storage is contended or
when cgroup throttling (io.latency) is in effect - a lot of time is
spent in submit_bio(). In that case, we underreport memory pressure.

Annotate submit_bio() to account submission time as memory stall when
the bio is reading userspace workingset pages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 block/bio.c               |  3 +++
 block/blk-core.c          | 23 ++++++++++++++++++++++-
 include/linux/blk_types.h |  1 +
 3 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index 299a0e7651ec..4196865dd300 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -806,6 +806,9 @@ void __bio_add_page(struct bio *bio, struct page *page,
 
 	bio->bi_iter.bi_size += len;
 	bio->bi_vcnt++;
+
+	if (!bio_flagged(bio, BIO_WORKINGSET) && unlikely(PageWorkingset(page)))
+		bio_set_flag(bio, BIO_WORKINGSET);
 }
 EXPORT_SYMBOL_GPL(__bio_add_page);
 
diff --git a/block/blk-core.c b/block/blk-core.c
index d0cc6e14d2f0..1b1705b7dde7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -36,6 +36,7 @@
 #include <linux/blk-cgroup.h>
 #include <linux/debugfs.h>
 #include <linux/bpf.h>
+#include <linux/psi.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/block.h>
@@ -1128,6 +1129,10 @@ EXPORT_SYMBOL_GPL(direct_make_request);
  */
 blk_qc_t submit_bio(struct bio *bio)
 {
+	bool workingset_read = false;
+	unsigned long pflags;
+	blk_qc_t ret;
+
 	if (blkcg_punt_bio_submit(bio))
 		return BLK_QC_T_NONE;
 
@@ -1146,6 +1151,8 @@ blk_qc_t submit_bio(struct bio *bio)
 		if (op_is_write(bio_op(bio))) {
 			count_vm_events(PGPGOUT, count);
 		} else {
+			if (bio_flagged(bio, BIO_WORKINGSET))
+				workingset_read = true;
 			task_io_account_read(bio->bi_iter.bi_size);
 			count_vm_events(PGPGIN, count);
 		}
@@ -1160,7 +1167,21 @@ blk_qc_t submit_bio(struct bio *bio)
 		}
 	}
 
-	return generic_make_request(bio);
+	/*
+	 * If we're reading data that is part of the userspace
+	 * workingset, count submission time as memory stall. When the
+	 * device is congested, or the submitting cgroup IO-throttled,
+	 * submission can be a significant part of overall IO time.
+	 */
+	if (workingset_read)
+		psi_memstall_enter(&pflags);
+
+	ret = generic_make_request(bio);
+
+	if (workingset_read)
+		psi_memstall_leave(&pflags);
+
+	return ret;
 }
 EXPORT_SYMBOL(submit_bio);
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 1b1fa1557e68..a9dadfc16a92 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -209,6 +209,7 @@ enum {
 	BIO_BOUNCED,		/* bio is a bounce bio */
 	BIO_USER_MAPPED,	/* contains user pages */
 	BIO_NULL_MAPPED,	/* contains invalid user pages */
+	BIO_WORKINGSET,		/* contains userspace workingset pages */
 	BIO_QUIET,		/* Make BIO Quiet */
 	BIO_CHAIN,		/* chained bio, ->bi_remaining in effect */
 	BIO_REFFED,		/* bio has elevated ->bi_cnt */
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH RESEND] block: annotate refault stalls from IO submission
  2019-08-08 19:03 [PATCH RESEND] block: annotate refault stalls from IO submission Johannes Weiner
@ 2019-08-09 22:12 ` Dave Chinner
  2019-08-13 17:46   ` Johannes Weiner
  2019-08-09 23:03 ` Suren Baghdasaryan
  2019-08-14 14:50 ` Jens Axboe
  2 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2019-08-09 22:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Jens Axboe, Andrew Morton, linux-mm, linux-btrfs, linux-ext4,
	linux-fsdevel, linux-block, linux-kernel

On Thu, Aug 08, 2019 at 03:03:00PM -0400, Johannes Weiner wrote:
> psi tracks the time tasks wait for refaulting pages to become
> uptodate, but it does not track the time spent submitting the IO. The
> submission part can be significant if backing storage is contended or
> when cgroup throttling (io.latency) is in effect - a lot of time is

Or the wbt is throttling.

> spent in submit_bio(). In that case, we underreport memory pressure.
> 
> Annotate submit_bio() to account submission time as memory stall when
> the bio is reading userspace workingset pages.

PAtch looks fine to me, but it raises another question w.r.t. IO
stalls and reclaim pressure feedback to the vm: how do we make use
of the pressure stall infrastructure to track inode cache pressure
and stalls?

With the congestion_wait() and wait_iff_congested() being entire
non-functional for block devices since 5.0, there is no IO load
based feedback going into memory reclaim from shrinkers that might
require IO to free objects before they can be reclaimed. This is
directly analogous to page reclaim writing back dirty pages from
the LRU, and as I understand it one of things the PSI is supposed
to be tracking.

Lots of workloads create inode cache pressure and often it can
dominate the time spent in memory reclaim, so it would seem to me
that having PSI only track/calculate pressure and stalls from LRU
pages misses a fair chunk of the memory pressure and reclaim stalls
that can be occurring.

Any thoughts of how we might be able to integrate more of the system
caches into the PSI infrastructure, Johannes?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH RESEND] block: annotate refault stalls from IO submission
  2019-08-08 19:03 [PATCH RESEND] block: annotate refault stalls from IO submission Johannes Weiner
  2019-08-09 22:12 ` Dave Chinner
@ 2019-08-09 23:03 ` Suren Baghdasaryan
  2019-08-14 14:50 ` Jens Axboe
  2 siblings, 0 replies; 7+ messages in thread
From: Suren Baghdasaryan @ 2019-08-09 23:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Jens Axboe, Dave Chinner, Andrew Morton, linux-mm, linux-btrfs,
	linux-ext4, linux-fsdevel, linux-block, LKML

On Thu, Aug 8, 2019 at 12:03 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> psi tracks the time tasks wait for refaulting pages to become
> uptodate, but it does not track the time spent submitting the IO. The
> submission part can be significant if backing storage is contended or
> when cgroup throttling (io.latency) is in effect - a lot of time is
> spent in submit_bio(). In that case, we underreport memory pressure.
>
> Annotate submit_bio() to account submission time as memory stall when
> the bio is reading userspace workingset pages.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  block/bio.c               |  3 +++
>  block/blk-core.c          | 23 ++++++++++++++++++++++-
>  include/linux/blk_types.h |  1 +
>  3 files changed, 26 insertions(+), 1 deletion(-)
>
> diff --git a/block/bio.c b/block/bio.c
> index 299a0e7651ec..4196865dd300 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -806,6 +806,9 @@ void __bio_add_page(struct bio *bio, struct page *page,
>
>         bio->bi_iter.bi_size += len;
>         bio->bi_vcnt++;
> +
> +       if (!bio_flagged(bio, BIO_WORKINGSET) && unlikely(PageWorkingset(page)))
> +               bio_set_flag(bio, BIO_WORKINGSET);
>  }
>  EXPORT_SYMBOL_GPL(__bio_add_page);
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index d0cc6e14d2f0..1b1705b7dde7 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -36,6 +36,7 @@
>  #include <linux/blk-cgroup.h>
>  #include <linux/debugfs.h>
>  #include <linux/bpf.h>
> +#include <linux/psi.h>
>
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/block.h>
> @@ -1128,6 +1129,10 @@ EXPORT_SYMBOL_GPL(direct_make_request);
>   */
>  blk_qc_t submit_bio(struct bio *bio)
>  {
> +       bool workingset_read = false;
> +       unsigned long pflags;
> +       blk_qc_t ret;
> +
>         if (blkcg_punt_bio_submit(bio))
>                 return BLK_QC_T_NONE;
>
> @@ -1146,6 +1151,8 @@ blk_qc_t submit_bio(struct bio *bio)
>                 if (op_is_write(bio_op(bio))) {
>                         count_vm_events(PGPGOUT, count);
>                 } else {
> +                       if (bio_flagged(bio, BIO_WORKINGSET))
> +                               workingset_read = true;
>                         task_io_account_read(bio->bi_iter.bi_size);
>                         count_vm_events(PGPGIN, count);
>                 }
> @@ -1160,7 +1167,21 @@ blk_qc_t submit_bio(struct bio *bio)
>                 }
>         }
>
> -       return generic_make_request(bio);
> +       /*
> +        * If we're reading data that is part of the userspace
> +        * workingset, count submission time as memory stall. When the
> +        * device is congested, or the submitting cgroup IO-throttled,
> +        * submission can be a significant part of overall IO time.
> +        */
> +       if (workingset_read)
> +               psi_memstall_enter(&pflags);
> +
> +       ret = generic_make_request(bio);
> +
> +       if (workingset_read)
> +               psi_memstall_leave(&pflags);
> +
> +       return ret;
>  }
>  EXPORT_SYMBOL(submit_bio);
>
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 1b1fa1557e68..a9dadfc16a92 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -209,6 +209,7 @@ enum {
>         BIO_BOUNCED,            /* bio is a bounce bio */
>         BIO_USER_MAPPED,        /* contains user pages */
>         BIO_NULL_MAPPED,        /* contains invalid user pages */
> +       BIO_WORKINGSET,         /* contains userspace workingset pages */
>         BIO_QUIET,              /* Make BIO Quiet */
>         BIO_CHAIN,              /* chained bio, ->bi_remaining in effect */
>         BIO_REFFED,             /* bio has elevated ->bi_cnt */
> --
> 2.22.0
>

The change contributes to the amount of recorded stall while running
memory stress test with and without the patch. Did not notice any
performance regressions so far. Thanks!

Tested-by: Suren Baghdasaryan <surenb@google.com>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH RESEND] block: annotate refault stalls from IO submission
  2019-08-09 22:12 ` Dave Chinner
@ 2019-08-13 17:46   ` Johannes Weiner
  2019-08-14  2:51     ` Dave Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Johannes Weiner @ 2019-08-13 17:46 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Andrew Morton, linux-mm, linux-btrfs, linux-ext4,
	linux-fsdevel, linux-block, linux-kernel

On Sat, Aug 10, 2019 at 08:12:48AM +1000, Dave Chinner wrote:
> On Thu, Aug 08, 2019 at 03:03:00PM -0400, Johannes Weiner wrote:
> > psi tracks the time tasks wait for refaulting pages to become
> > uptodate, but it does not track the time spent submitting the IO. The
> > submission part can be significant if backing storage is contended or
> > when cgroup throttling (io.latency) is in effect - a lot of time is
> 
> Or the wbt is throttling.
> 
> > spent in submit_bio(). In that case, we underreport memory pressure.
> > 
> > Annotate submit_bio() to account submission time as memory stall when
> > the bio is reading userspace workingset pages.
> 
> PAtch looks fine to me, but it raises another question w.r.t. IO
> stalls and reclaim pressure feedback to the vm: how do we make use
> of the pressure stall infrastructure to track inode cache pressure
> and stalls?
> 
> With the congestion_wait() and wait_iff_congested() being entire
> non-functional for block devices since 5.0, there is no IO load
> based feedback going into memory reclaim from shrinkers that might
> require IO to free objects before they can be reclaimed. This is
> directly analogous to page reclaim writing back dirty pages from
> the LRU, and as I understand it one of things the PSI is supposed
> to be tracking.
>
> Lots of workloads create inode cache pressure and often it can
> dominate the time spent in memory reclaim, so it would seem to me
> that having PSI only track/calculate pressure and stalls from LRU
> pages misses a fair chunk of the memory pressure and reclaim stalls
> that can be occurring.

psi already tracks the entire reclaim operation. So if reclaim calls
into the shrinker and the shrinker scans inodes, initiates IO, or even
waits on IO, that time is accounted for as memory pressure stalling.

If you can think of asynchronous events that are initiated from
reclaim but cause indirect stalls in other contexts, contexts which
can clearly link the stall back to reclaim activity, we can annotate
them using psi_memstall_enter() / psi_memstall_leave().

In that vein, what would be great to have is be a distinction between
read stalls on dentries/inodes that have never been touched before
versus those that have been recently reclaimed - analogous to cold
page faults vs refaults.

It would help psi, sure, but more importantly it would help us better
balance pressure between filesystem metadata and the data pages. We
would be able to tell the difference between a `find /' and actual
thrashing, where hot inodes are getting kicked out and reloaded
repeatedly - and we could backfeed that pressure to the LRU pages to
allow the metadata caches to grow as needed.

For example, it could make sense to swap out a couple of completely
unused anonymous pages if it means we could hold the metadata
workingset fully in memory. But right now we cannot do that, because
we cannot risk swapping just because somebody runs find /.

I have semi-seriously talked to Josef about this before, but it wasn't
quite obvious where we could track non-residency or eviction
information for inodes, dentries etc. Maybe you have an idea?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH RESEND] block: annotate refault stalls from IO submission
  2019-08-13 17:46   ` Johannes Weiner
@ 2019-08-14  2:51     ` Dave Chinner
  2019-08-14 13:53       ` Johannes Weiner
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2019-08-14  2:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Jens Axboe, Andrew Morton, linux-mm, linux-btrfs, linux-ext4,
	linux-fsdevel, linux-block, linux-kernel

On Tue, Aug 13, 2019 at 01:46:25PM -0400, Johannes Weiner wrote:
> On Sat, Aug 10, 2019 at 08:12:48AM +1000, Dave Chinner wrote:
> > On Thu, Aug 08, 2019 at 03:03:00PM -0400, Johannes Weiner wrote:
> > > psi tracks the time tasks wait for refaulting pages to become
> > > uptodate, but it does not track the time spent submitting the IO. The
> > > submission part can be significant if backing storage is contended or
> > > when cgroup throttling (io.latency) is in effect - a lot of time is
> > 
> > Or the wbt is throttling.
> > 
> > > spent in submit_bio(). In that case, we underreport memory pressure.
> > > 
> > > Annotate submit_bio() to account submission time as memory stall when
> > > the bio is reading userspace workingset pages.
> > 
> > PAtch looks fine to me, but it raises another question w.r.t. IO
> > stalls and reclaim pressure feedback to the vm: how do we make use
> > of the pressure stall infrastructure to track inode cache pressure
> > and stalls?
> > 
> > With the congestion_wait() and wait_iff_congested() being entire
> > non-functional for block devices since 5.0, there is no IO load
> > based feedback going into memory reclaim from shrinkers that might
> > require IO to free objects before they can be reclaimed. This is
> > directly analogous to page reclaim writing back dirty pages from
> > the LRU, and as I understand it one of things the PSI is supposed
> > to be tracking.
> >
> > Lots of workloads create inode cache pressure and often it can
> > dominate the time spent in memory reclaim, so it would seem to me
> > that having PSI only track/calculate pressure and stalls from LRU
> > pages misses a fair chunk of the memory pressure and reclaim stalls
> > that can be occurring.
> 
> psi already tracks the entire reclaim operation. So if reclaim calls
> into the shrinker and the shrinker scans inodes, initiates IO, or even
> waits on IO, that time is accounted for as memory pressure stalling.

hmmmm - reclaim _scanning_ is considered a stall event? i.e. even if
scanning does not block, it's still accounting that _time_ as a
memory pressure stall?

I'm probably missing it, but I don't see anything in vmpressure()
that actually accounts for time spent scanning.  AFAICT it accounts
for LRU objects scanned and reclaimed from memcgs, and then the
memory freed from the shrinkers is accounted only to the
sc->target_mem_cgroup once all memcgs have been iterated.

So, AFAICT, there's no time aspect to this, and the amount of
scanning that shrinkers do is not taken into account, so pressure
can't really be determined properly there. It seems like what the
shrinkers reclaim will actually give an incorrect interpretation of
pressure right now...

> If you can think of asynchronous events that are initiated from
> reclaim but cause indirect stalls in other contexts, contexts which
> can clearly link the stall back to reclaim activity, we can annotate
> them using psi_memstall_enter() / psi_memstall_leave().

Well, I was more thinking that issuing/waiting on IOs is a stall
event, not scanning.

The IO-less inode reclaim stuff for XFS really needs the main
reclaim loop to back off under heavy IO load, but we cannot put the
entire metadata writeback path under psi_memstall_enter/leave()
because:

	a) it's not linked to any user context - it's a
	per-superblock kernel thread; and

	b) it's designed to always be stalled on IO when there is
	metadata writeback pressure. That pressure most often comes from
	running out of journal space rather than memory pressure, and
	really there is no way to distinguish between the two from
	the writeback context.

Hence I don't think the vmpressure mechanism does what the memory
reclaim scanning loops really need because they do not feed back a
clear picture of the load on the IO subsystem load into the reclaim
loops.....

> In that vein, what would be great to have is be a distinction between
> read stalls on dentries/inodes that have never been touched before
> versus those that have been recently reclaimed - analogous to cold
> page faults vs refaults.

See my "nonblocking inode reclaim for XFS" series. It adds new
measures of that the shrinkers feed back to the main reclaim loop.

One of those measures is the number of objects scanned. Shrinkers
already report the number of objects they free, but that it tossed
away and not used by the main reclaim loop.

As for cold faults vs refaults, we could only report that for
dentries - if the inode is still cached in memory, then the dentry
hits the inode cache (hot fault), otherwise it's got to fetch the
inode from disk (cold fault). There is no mechanisms for tracking
inodes that have been recently reclaimed - the VFS uses a hash for
tracking cached inodes and so there's nothign you can drop
exceptional entries into to track reclaim state.

That said, we /could/ do this with the XFS inode cache. It uses
radix trees to index the cache, not the VFS level inode hash. Hence
we could place exceptional entries into the tree on reclaim to do
working set tracking similar to the way the page cache is used to
track the working set of pages.

The other thing we could do here is similar to the dentry cache - we
often have inode metadata buffers in the buffer cache (we have a
multi-level cache heirarchy that spans most of the metadata in the
active working set in XFS) and so if we miss the inode cache we
might hit the inode buffer in the buffer cache (hot fault).  If we
miss the inode cache and have to do IO to read the inodes, then it's
a cold fault.

That might be misleading, however, because the inode buffers cache
32 physical inodes and so reading 32 sequential cold inodes would
give 1 cold fault and 31 hot faults, even though those 31 inodes
have never been referenced by the workload before and that's not
ideal.

To complicate matters further, we can thrash the buffer cache,
resulting in cached inodes that have no backing buffer in memory.
then we we go to write the inode, we have to read in the inode
buffer before we can write it. This may have nothing to do with
working set thrashing, too. e.g. we have an inode that has been
referenced over and over again by the workload for several hours,
then a relatime update occurs and the inode is dirtied. when
writeback occurs, the inode buffer is nowhere to be found because it
was cycled out of the buffer cache hours ago and hasn't been
referenced since. hence I think we're probably best to ignore the
underlying filesystem metadata cache for the purposes of measuring
and detecting inode cache working set thrashing...

> It would help psi, sure, but more importantly it would help us better
> balance pressure between filesystem metadata and the data pages. We
> would be able to tell the difference between a `find /' and actual
> thrashing, where hot inodes are getting kicked out and reloaded
> repeatedly - and we could backfeed that pressure to the LRU pages to
> allow the metadata caches to grow as needed.

Well, hot inodes getting kicked out and immmediate re-used is
something we largely already handle via caching inode buffers in the
buffer cache.  So inode cache misses on XFS likely happen a lot more
than is obvious as we only see inode cache thrashing when we have
misses the metadata cache, not the inode cache.

> For example, it could make sense to swap out a couple of completely
> unused anonymous pages if it means we could hold the metadata
> workingset fully in memory. But right now we cannot do that, because
> we cannot risk swapping just because somebody runs find /.

I have workloads that run find to produce slab cache memory
pressure. On 5.2, they cause the system to swap madly because
there's no file page cache to reclaim and so the only reclaim that
can be done is inodes/dentries and swapping anonymous pages.

And swap it does - if we don't throttle reclaim sufficiently to
allow IO to make progress, then direct relcaim ends up in priority
windup and I see lots of OOM kills on swap-in. I found quite a few
ways to end up in "reclaim doesn't throttle on IO sufficiently and
OOM kills" in the last 3-4 weeks...

> I have semi-seriously talked to Josef about this before, but it wasn't
> quite obvious where we could track non-residency or eviction
> information for inodes, dentries etc. Maybe you have an idea?

See above - I think only XFS could track working inodes because of
it's unique caching infrastructure. Dentries largely don't matter,
because dentry cache misses either hit or miss the inode cache and
that's the working set that largely matters in terms of detecting
IO-related thrashing...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH RESEND] block: annotate refault stalls from IO submission
  2019-08-14  2:51     ` Dave Chinner
@ 2019-08-14 13:53       ` Johannes Weiner
  0 siblings, 0 replies; 7+ messages in thread
From: Johannes Weiner @ 2019-08-14 13:53 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Andrew Morton, linux-mm, linux-btrfs, linux-ext4,
	linux-fsdevel, linux-block, linux-kernel

On Wed, Aug 14, 2019 at 12:51:30PM +1000, Dave Chinner wrote:
> On Tue, Aug 13, 2019 at 01:46:25PM -0400, Johannes Weiner wrote:
> > On Sat, Aug 10, 2019 at 08:12:48AM +1000, Dave Chinner wrote:
> > > On Thu, Aug 08, 2019 at 03:03:00PM -0400, Johannes Weiner wrote:
> > > > psi tracks the time tasks wait for refaulting pages to become
> > > > uptodate, but it does not track the time spent submitting the IO. The
> > > > submission part can be significant if backing storage is contended or
> > > > when cgroup throttling (io.latency) is in effect - a lot of time is
> > > 
> > > Or the wbt is throttling.
> > > 
> > > > spent in submit_bio(). In that case, we underreport memory pressure.
> > > > 
> > > > Annotate submit_bio() to account submission time as memory stall when
> > > > the bio is reading userspace workingset pages.
> > > 
> > > PAtch looks fine to me, but it raises another question w.r.t. IO
> > > stalls and reclaim pressure feedback to the vm: how do we make use
> > > of the pressure stall infrastructure to track inode cache pressure
> > > and stalls?
> > > 
> > > With the congestion_wait() and wait_iff_congested() being entire
> > > non-functional for block devices since 5.0, there is no IO load
> > > based feedback going into memory reclaim from shrinkers that might
> > > require IO to free objects before they can be reclaimed. This is
> > > directly analogous to page reclaim writing back dirty pages from
> > > the LRU, and as I understand it one of things the PSI is supposed
> > > to be tracking.
> > >
> > > Lots of workloads create inode cache pressure and often it can
> > > dominate the time spent in memory reclaim, so it would seem to me
> > > that having PSI only track/calculate pressure and stalls from LRU
> > > pages misses a fair chunk of the memory pressure and reclaim stalls
> > > that can be occurring.
> > 
> > psi already tracks the entire reclaim operation. So if reclaim calls
> > into the shrinker and the shrinker scans inodes, initiates IO, or even
> > waits on IO, that time is accounted for as memory pressure stalling.
> 
> hmmmm - reclaim _scanning_ is considered a stall event? i.e. even if
> scanning does not block, it's still accounting that _time_ as a
> memory pressure stall?

Yes. Reclaim doesn't need to block, the entire operation itself is an
interruption of the workload that only happens due to a lack of RAM.

Of course, as long as kswapd is just picking up one-off cache, it does
not take a whole lot of time, and it will barely register as
pressure. But as memory demand mounts and we have to look harder for
unused pages, reclaim time can become significant, even without IO.

> I'm probably missing it, but I don't see anything in vmpressure()
> that actually accounts for time spent scanning.  AFAICT it accounts
> for LRU objects scanned and reclaimed from memcgs, and then the
> memory freed from the shrinkers is accounted only to the
> sc->target_mem_cgroup once all memcgs have been iterated.

vmpressure is an orthogonal feature that is based purely on reclaim
efficiency (reclaimed/scanned).

psi accounting begins when we first call into try_to_free_pages() and
friends. psi_memstall_enter() marks the task, and it's the scheduler
part of psi that aggregates task state time into pressure ratios.

> > If you can think of asynchronous events that are initiated from
> > reclaim but cause indirect stalls in other contexts, contexts which
> > can clearly link the stall back to reclaim activity, we can annotate
> > them using psi_memstall_enter() / psi_memstall_leave().
> 
> Well, I was more thinking that issuing/waiting on IOs is a stall
> event, not scanning.
> 
> The IO-less inode reclaim stuff for XFS really needs the main
> reclaim loop to back off under heavy IO load, but we cannot put the
> entire metadata writeback path under psi_memstall_enter/leave()
> because:
> 
> 	a) it's not linked to any user context - it's a
> 	per-superblock kernel thread; and
> 
> 	b) it's designed to always be stalled on IO when there is
> 	metadata writeback pressure. That pressure most often comes from
> 	running out of journal space rather than memory pressure, and
> 	really there is no way to distinguish between the two from
> 	the writeback context.
> 
> Hence I don't think the vmpressure mechanism does what the memory
> reclaim scanning loops really need because they do not feed back a
> clear picture of the load on the IO subsystem load into the reclaim
> loops.....

Memory pressure metrics really seem unrelated to this problem, and
that's not what vmpressure or psi try to solve in the first place.

When you say we need better IO pressure feedback / congestion
throttling in reclaim, I can believe it, even though it's not
something we necessarily observed in our fleet.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH RESEND] block: annotate refault stalls from IO submission
  2019-08-08 19:03 [PATCH RESEND] block: annotate refault stalls from IO submission Johannes Weiner
  2019-08-09 22:12 ` Dave Chinner
  2019-08-09 23:03 ` Suren Baghdasaryan
@ 2019-08-14 14:50 ` Jens Axboe
  2 siblings, 0 replies; 7+ messages in thread
From: Jens Axboe @ 2019-08-14 14:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Chinner, Andrew Morton, linux-mm, linux-btrfs, linux-ext4,
	linux-fsdevel, linux-block, linux-kernel

On 8/8/19 1:03 PM, Johannes Weiner wrote:
> psi tracks the time tasks wait for refaulting pages to become
> uptodate, but it does not track the time spent submitting the IO. The
> submission part can be significant if backing storage is contended or
> when cgroup throttling (io.latency) is in effect - a lot of time is
> spent in submit_bio(). In that case, we underreport memory pressure.
> 
> Annotate submit_bio() to account submission time as memory stall when
> the bio is reading userspace workingset pages.

Applied for 5.4.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-08-14 14:50 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-08 19:03 [PATCH RESEND] block: annotate refault stalls from IO submission Johannes Weiner
2019-08-09 22:12 ` Dave Chinner
2019-08-13 17:46   ` Johannes Weiner
2019-08-14  2:51     ` Dave Chinner
2019-08-14 13:53       ` Johannes Weiner
2019-08-09 23:03 ` Suren Baghdasaryan
2019-08-14 14:50 ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).