* [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode @ 2020-05-07 0:42 Paul E. McKenney 2020-05-07 0:55 ` Andrew Morton [not found] ` <20200507093647.11932-1-hdanton@sina.com> 0 siblings, 2 replies; 20+ messages in thread From: Paul E. McKenney @ 2020-05-07 0:42 UTC (permalink / raw) To: rcu Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, viro, hannes This commit adds a shrinker so as to inform RCU when memory is scarce. RCU responds by shifting into the same fast and inefficient mode that is used in the presence of excessive numbers of RCU callbacks. RCU remains in this state for one-tenth of a second, though this time window can be extended by another call to the shrinker. If it proves feasible, a later commit might add a function call directly indicating the end of the period of scarce memory. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index b0fe32f..76d148d 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2368,8 +2368,15 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp)) struct rcu_data *rdp; struct rcu_node *rnp; - rcu_state.cbovld = rcu_state.cbovldnext; + // Load .oomovld before .oomovldend, pairing with .oomovld set. + rcu_state.cbovld = smp_load_acquire(&rcu_state.oomovld) || // ^^^ + rcu_state.cbovldnext; rcu_state.cbovldnext = false; + if (READ_ONCE(rcu_state.oomovld) && + time_after(jiffies, READ_ONCE(rcu_state.oomovldend))) { + WRITE_ONCE(rcu_state.oomovld, false); + pr_info("%s: Ending OOM-mode grace periods.\n", __func__); + } rcu_for_each_leaf_node(rnp) { cond_resched_tasks_rcu_qs(); mask = 0; @@ -2697,6 +2704,35 @@ static void check_cb_ovld(struct rcu_data *rdp) raw_spin_unlock_rcu_node(rnp); } +/* Return a rough count of the RCU callbacks outstanding. */ +static unsigned long rcu_oom_count(struct shrinker *unused1, + struct shrink_control *unused2) +{ + int cpu; + unsigned long ncbs = 0; + + for_each_possible_cpu(cpu) + ncbs += rcu_get_n_cbs_cpu(cpu); + return ncbs; +} + +/* Start up an interval of fast high-overhead grace periods. */ +static unsigned long rcu_oom_scan(struct shrinker *unused1, + struct shrink_control *unused2) +{ + pr_info("%s: Starting OOM-mode grace periods.\n", __func__); + WRITE_ONCE(rcu_state.oomovldend, jiffies + HZ / 10); + smp_store_release(&rcu_state.oomovld, true); // After .oomovldend + rcu_force_quiescent_state(); // Kick grace period + return 0; // We haven't actually reclaimed anything yet. +} + +static struct shrinker rcu_shrinker = { + .count_objects = rcu_oom_count, + .scan_objects = rcu_oom_scan, + .seeks = DEFAULT_SEEKS, +}; + /* Helper function for call_rcu() and friends. */ static void __call_rcu(struct rcu_head *head, rcu_callback_t func) @@ -4146,6 +4182,7 @@ void __init rcu_init(void) qovld_calc = DEFAULT_RCU_QOVLD_MULT * qhimark; else qovld_calc = qovld; + WARN_ON(register_shrinker(&rcu_shrinker)); } #include "tree_stall.h" diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 2d7fcb9..c4d8e96 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -326,6 +326,8 @@ struct rcu_state { int ncpus_snap; /* # CPUs seen last time. */ u8 cbovld; /* Callback overload now? */ u8 cbovldnext; /* ^ ^ next time? */ + u8 oomovld; /* OOM overload? */ + unsigned long oomovldend; /* OOM ovld end, jiffies. */ unsigned long jiffies_force_qs; /* Time at which to invoke */ /* force_quiescent_state(). */ ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode 2020-05-07 0:42 [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode Paul E. McKenney @ 2020-05-07 0:55 ` Andrew Morton 2020-05-07 2:45 ` Paul E. McKenney 2020-05-07 17:00 ` Johannes Weiner [not found] ` <20200507093647.11932-1-hdanton@sina.com> 1 sibling, 2 replies; 20+ messages in thread From: Andrew Morton @ 2020-05-07 0:55 UTC (permalink / raw) To: paulmck Cc: rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, viro, hannes, Dave Chinner On Wed, 6 May 2020 17:42:40 -0700 "Paul E. McKenney" <paulmck@kernel.org> wrote: > This commit adds a shrinker so as to inform RCU when memory is scarce. > RCU responds by shifting into the same fast and inefficient mode that is > used in the presence of excessive numbers of RCU callbacks. RCU remains > in this state for one-tenth of a second, though this time window can be > extended by another call to the shrinker. > > If it proves feasible, a later commit might add a function call directly > indicating the end of the period of scarce memory. (Cc David Chinner, who often has opinions on shrinkers ;)) It's a bit abusive of the intent of the slab shrinkers, but I don't immediately see a problem with it. Always returning 0 from ->scan_objects might cause a problem in some situations(?). Perhaps we should have a formal "system getting low on memory, please do something" notification API. How significant is this? How much memory can RCU consume? > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -2368,8 +2368,15 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp)) > struct rcu_data *rdp; > struct rcu_node *rnp; > > - rcu_state.cbovld = rcu_state.cbovldnext; > + // Load .oomovld before .oomovldend, pairing with .oomovld set. > + rcu_state.cbovld = smp_load_acquire(&rcu_state.oomovld) || // ^^^ > + rcu_state.cbovldnext; > rcu_state.cbovldnext = false; > + if (READ_ONCE(rcu_state.oomovld) && > + time_after(jiffies, READ_ONCE(rcu_state.oomovldend))) { > + WRITE_ONCE(rcu_state.oomovld, false); > + pr_info("%s: Ending OOM-mode grace periods.\n", __func__); > + } > rcu_for_each_leaf_node(rnp) { > cond_resched_tasks_rcu_qs(); > mask = 0; > @@ -2697,6 +2704,35 @@ static void check_cb_ovld(struct rcu_data *rdp) > raw_spin_unlock_rcu_node(rnp); > } > > +/* Return a rough count of the RCU callbacks outstanding. */ > +static unsigned long rcu_oom_count(struct shrinker *unused1, > + struct shrink_control *unused2) > +{ > + int cpu; > + unsigned long ncbs = 0; > + > + for_each_possible_cpu(cpu) > + ncbs += rcu_get_n_cbs_cpu(cpu); > + return ncbs; > +} > + > +/* Start up an interval of fast high-overhead grace periods. */ > +static unsigned long rcu_oom_scan(struct shrinker *unused1, > + struct shrink_control *unused2) > +{ > + pr_info("%s: Starting OOM-mode grace periods.\n", __func__); > + WRITE_ONCE(rcu_state.oomovldend, jiffies + HZ / 10); > + smp_store_release(&rcu_state.oomovld, true); // After .oomovldend > + rcu_force_quiescent_state(); // Kick grace period > + return 0; // We haven't actually reclaimed anything yet. > +} > + > +static struct shrinker rcu_shrinker = { > + .count_objects = rcu_oom_count, > + .scan_objects = rcu_oom_scan, > + .seeks = DEFAULT_SEEKS, > +}; > + > /* Helper function for call_rcu() and friends. */ > static void > __call_rcu(struct rcu_head *head, rcu_callback_t func) > @@ -4146,6 +4182,7 @@ void __init rcu_init(void) > qovld_calc = DEFAULT_RCU_QOVLD_MULT * qhimark; > else > qovld_calc = qovld; > + WARN_ON(register_shrinker(&rcu_shrinker)); > } > > #include "tree_stall.h" > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h > index 2d7fcb9..c4d8e96 100644 > --- a/kernel/rcu/tree.h > +++ b/kernel/rcu/tree.h > @@ -326,6 +326,8 @@ struct rcu_state { > int ncpus_snap; /* # CPUs seen last time. */ > u8 cbovld; /* Callback overload now? */ > u8 cbovldnext; /* ^ ^ next time? */ > + u8 oomovld; /* OOM overload? */ > + unsigned long oomovldend; /* OOM ovld end, jiffies. */ > > unsigned long jiffies_force_qs; /* Time at which to invoke */ > /* force_quiescent_state(). */ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode 2020-05-07 0:55 ` Andrew Morton @ 2020-05-07 2:45 ` Paul E. McKenney 2020-05-07 17:00 ` Johannes Weiner 1 sibling, 0 replies; 20+ messages in thread From: Paul E. McKenney @ 2020-05-07 2:45 UTC (permalink / raw) To: Andrew Morton Cc: rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, viro, hannes, Dave Chinner On Wed, May 06, 2020 at 05:55:35PM -0700, Andrew Morton wrote: > On Wed, 6 May 2020 17:42:40 -0700 "Paul E. McKenney" <paulmck@kernel.org> wrote: > > > This commit adds a shrinker so as to inform RCU when memory is scarce. > > RCU responds by shifting into the same fast and inefficient mode that is > > used in the presence of excessive numbers of RCU callbacks. RCU remains > > in this state for one-tenth of a second, though this time window can be > > extended by another call to the shrinker. > > > > If it proves feasible, a later commit might add a function call directly > > indicating the end of the period of scarce memory. > > (Cc David Chinner, who often has opinions on shrinkers ;)) > > It's a bit abusive of the intent of the slab shrinkers, but I don't > immediately see a problem with it. Always returning 0 from > ->scan_objects might cause a problem in some situations(?). I could just divide the total number of callbacks by 16 or some such, if that would work better. > Perhaps we should have a formal "system getting low on memory, please > do something" notification API. That would be a very good thing to have! But from what I can see, the shrinker interface is currently the closest approximation to such an interface. > How significant is this? How much memory can RCU consume? This depends on the configuration and workload. By default, RCU starts getting concerned if any CPU exceeds 10,000 callbacks. It is not all -that- hard to cause RCU to have tens of millions of callbacks queued, though some would argue that workloads doing this are rather abusive. But at 1K per, this maps to 10GB of storage. But in more normal workloads, I would expect the amount of storage awaiting an RCU grace period to not even come close to a gigabyte. Thoughts? Thanx, Paul > > --- a/kernel/rcu/tree.c > > +++ b/kernel/rcu/tree.c > > @@ -2368,8 +2368,15 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp)) > > struct rcu_data *rdp; > > struct rcu_node *rnp; > > > > - rcu_state.cbovld = rcu_state.cbovldnext; > > + // Load .oomovld before .oomovldend, pairing with .oomovld set. > > + rcu_state.cbovld = smp_load_acquire(&rcu_state.oomovld) || // ^^^ > > + rcu_state.cbovldnext; > > rcu_state.cbovldnext = false; > > + if (READ_ONCE(rcu_state.oomovld) && > > + time_after(jiffies, READ_ONCE(rcu_state.oomovldend))) { > > + WRITE_ONCE(rcu_state.oomovld, false); > > + pr_info("%s: Ending OOM-mode grace periods.\n", __func__); > > + } > > rcu_for_each_leaf_node(rnp) { > > cond_resched_tasks_rcu_qs(); > > mask = 0; > > @@ -2697,6 +2704,35 @@ static void check_cb_ovld(struct rcu_data *rdp) > > raw_spin_unlock_rcu_node(rnp); > > } > > > > +/* Return a rough count of the RCU callbacks outstanding. */ > > +static unsigned long rcu_oom_count(struct shrinker *unused1, > > + struct shrink_control *unused2) > > +{ > > + int cpu; > > + unsigned long ncbs = 0; > > + > > + for_each_possible_cpu(cpu) > > + ncbs += rcu_get_n_cbs_cpu(cpu); > > + return ncbs; > > +} > > + > > +/* Start up an interval of fast high-overhead grace periods. */ > > +static unsigned long rcu_oom_scan(struct shrinker *unused1, > > + struct shrink_control *unused2) > > +{ > > + pr_info("%s: Starting OOM-mode grace periods.\n", __func__); > > + WRITE_ONCE(rcu_state.oomovldend, jiffies + HZ / 10); > > + smp_store_release(&rcu_state.oomovld, true); // After .oomovldend > > + rcu_force_quiescent_state(); // Kick grace period > > + return 0; // We haven't actually reclaimed anything yet. > > +} > > + > > +static struct shrinker rcu_shrinker = { > > + .count_objects = rcu_oom_count, > > + .scan_objects = rcu_oom_scan, > > + .seeks = DEFAULT_SEEKS, > > +}; > > + > > /* Helper function for call_rcu() and friends. */ > > static void > > __call_rcu(struct rcu_head *head, rcu_callback_t func) > > @@ -4146,6 +4182,7 @@ void __init rcu_init(void) > > qovld_calc = DEFAULT_RCU_QOVLD_MULT * qhimark; > > else > > qovld_calc = qovld; > > + WARN_ON(register_shrinker(&rcu_shrinker)); > > } > > > > #include "tree_stall.h" > > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h > > index 2d7fcb9..c4d8e96 100644 > > --- a/kernel/rcu/tree.h > > +++ b/kernel/rcu/tree.h > > @@ -326,6 +326,8 @@ struct rcu_state { > > int ncpus_snap; /* # CPUs seen last time. */ > > u8 cbovld; /* Callback overload now? */ > > u8 cbovldnext; /* ^ ^ next time? */ > > + u8 oomovld; /* OOM overload? */ > > + unsigned long oomovldend; /* OOM ovld end, jiffies. */ > > > > unsigned long jiffies_force_qs; /* Time at which to invoke */ > > /* force_quiescent_state(). */ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode 2020-05-07 0:55 ` Andrew Morton 2020-05-07 2:45 ` Paul E. McKenney @ 2020-05-07 17:00 ` Johannes Weiner 2020-05-07 17:09 ` Paul E. McKenney 1 sibling, 1 reply; 20+ messages in thread From: Johannes Weiner @ 2020-05-07 17:00 UTC (permalink / raw) To: Andrew Morton Cc: paulmck, rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, viro, Dave Chinner On Wed, May 06, 2020 at 05:55:35PM -0700, Andrew Morton wrote: > On Wed, 6 May 2020 17:42:40 -0700 "Paul E. McKenney" <paulmck@kernel.org> wrote: > > > This commit adds a shrinker so as to inform RCU when memory is scarce. > > RCU responds by shifting into the same fast and inefficient mode that is > > used in the presence of excessive numbers of RCU callbacks. RCU remains > > in this state for one-tenth of a second, though this time window can be > > extended by another call to the shrinker. We may be able to use shrinkers here, but merely being invoked does not carry a reliable distress signal. Shrinkers get invoked whenever vmscan runs. It's a useful indicator for when to age an auxiliary LRU list - test references, clear and rotate or reclaim stale entries. The urgency, and what can and cannot be considered "stale", is encoded in the callback frequency and scan counts, and meant to be relative to the VM's own rate of aging: "I've tested X percent of mine for recent use, now you go and test the same share of your pool." It doesn't translate well to other interpretations of the callbacks, although people have tried. > > If it proves feasible, a later commit might add a function call directly > > indicating the end of the period of scarce memory. > > (Cc David Chinner, who often has opinions on shrinkers ;)) > > It's a bit abusive of the intent of the slab shrinkers, but I don't > immediately see a problem with it. Always returning 0 from > ->scan_objects might cause a problem in some situations(?). > > Perhaps we should have a formal "system getting low on memory, please > do something" notification API. It's tricky to find a useful definition of what low on memory means. In the past we've used sc->priority cutoffs, the vmpressure interface (reclaimed/scanned - reclaim efficiency cutoffs), oom notifiers (another reclaim efficiency cutoff). But none of these reliably capture "distress", and they vary highly between different hardware setups. It can be hard to trigger OOM itself on fast IO devices, even when the machine is way past useful (where useful is somewhat subjective to the user). Userspace OOM implementations that consider userspace health (also subjective) are getting more common. > How significant is this? How much memory can RCU consume? I think if rcu can end up consuming a significant share of memory, one way that may work would be to do proper shrinker integration and track the age of its objects relative to the age of other allocations in the system. I.e. toss them all on a clock list with "new" bits and shrink them at VM velocity. If the shrinker sees objects with new bit set, clear and rotate. If it sees objects without them, we know rcu_heads outlive cache pages etc. and should probably cycle faster too. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode 2020-05-07 17:00 ` Johannes Weiner @ 2020-05-07 17:09 ` Paul E. McKenney 2020-05-07 17:29 ` Paul E. McKenney 2020-05-07 18:31 ` Johannes Weiner 0 siblings, 2 replies; 20+ messages in thread From: Paul E. McKenney @ 2020-05-07 17:09 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, viro, Dave Chinner On Thu, May 07, 2020 at 01:00:06PM -0400, Johannes Weiner wrote: > On Wed, May 06, 2020 at 05:55:35PM -0700, Andrew Morton wrote: > > On Wed, 6 May 2020 17:42:40 -0700 "Paul E. McKenney" <paulmck@kernel.org> wrote: > > > > > This commit adds a shrinker so as to inform RCU when memory is scarce. > > > RCU responds by shifting into the same fast and inefficient mode that is > > > used in the presence of excessive numbers of RCU callbacks. RCU remains > > > in this state for one-tenth of a second, though this time window can be > > > extended by another call to the shrinker. > > We may be able to use shrinkers here, but merely being invoked does > not carry a reliable distress signal. > > Shrinkers get invoked whenever vmscan runs. It's a useful indicator > for when to age an auxiliary LRU list - test references, clear and > rotate or reclaim stale entries. The urgency, and what can and cannot > be considered "stale", is encoded in the callback frequency and scan > counts, and meant to be relative to the VM's own rate of aging: "I've > tested X percent of mine for recent use, now you go and test the same > share of your pool." It doesn't translate well to other > interpretations of the callbacks, although people have tried. Would it make sense for RCU to interpret two invocations within (say) 100ms of each other as indicating urgency? (Hey, I had to ask!) > > > If it proves feasible, a later commit might add a function call directly > > > indicating the end of the period of scarce memory. > > > > (Cc David Chinner, who often has opinions on shrinkers ;)) > > > > It's a bit abusive of the intent of the slab shrinkers, but I don't > > immediately see a problem with it. Always returning 0 from > > ->scan_objects might cause a problem in some situations(?). > > > > Perhaps we should have a formal "system getting low on memory, please > > do something" notification API. > > It's tricky to find a useful definition of what low on memory > means. In the past we've used sc->priority cutoffs, the vmpressure > interface (reclaimed/scanned - reclaim efficiency cutoffs), oom > notifiers (another reclaim efficiency cutoff). But none of these > reliably capture "distress", and they vary highly between different > hardware setups. It can be hard to trigger OOM itself on fast IO > devices, even when the machine is way past useful (where useful is > somewhat subjective to the user). Userspace OOM implementations that > consider userspace health (also subjective) are getting more common. > > > How significant is this? How much memory can RCU consume? > > I think if rcu can end up consuming a significant share of memory, one > way that may work would be to do proper shrinker integration and track > the age of its objects relative to the age of other allocations in the > system. I.e. toss them all on a clock list with "new" bits and shrink > them at VM velocity. If the shrinker sees objects with new bit set, > clear and rotate. If it sees objects without them, we know rcu_heads > outlive cache pages etc. and should probably cycle faster too. It would be easy for RCU to pass back (or otherwise use) the age of the current grace period, if that would help. Tracking the age of individual callbacks is out of the question due to memory overhead, but RCU could approximate this via statistical sampling. Comparing this to grace-period durations could give information as to whether making grace periods go faster would be helpful. But, yes, it would be better to have an elusive unambiguous indication of distress. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode 2020-05-07 17:09 ` Paul E. McKenney @ 2020-05-07 17:29 ` Paul E. McKenney 2020-05-07 18:31 ` Johannes Weiner 1 sibling, 0 replies; 20+ messages in thread From: Paul E. McKenney @ 2020-05-07 17:29 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, viro, Dave Chinner On Thu, May 07, 2020 at 10:09:03AM -0700, Paul E. McKenney wrote: > On Thu, May 07, 2020 at 01:00:06PM -0400, Johannes Weiner wrote: > > On Wed, May 06, 2020 at 05:55:35PM -0700, Andrew Morton wrote: > > > On Wed, 6 May 2020 17:42:40 -0700 "Paul E. McKenney" <paulmck@kernel.org> wrote: > > > > > > > This commit adds a shrinker so as to inform RCU when memory is scarce. > > > > RCU responds by shifting into the same fast and inefficient mode that is > > > > used in the presence of excessive numbers of RCU callbacks. RCU remains > > > > in this state for one-tenth of a second, though this time window can be > > > > extended by another call to the shrinker. > > > > We may be able to use shrinkers here, but merely being invoked does > > not carry a reliable distress signal. > > > > Shrinkers get invoked whenever vmscan runs. It's a useful indicator > > for when to age an auxiliary LRU list - test references, clear and > > rotate or reclaim stale entries. The urgency, and what can and cannot > > be considered "stale", is encoded in the callback frequency and scan > > counts, and meant to be relative to the VM's own rate of aging: "I've > > tested X percent of mine for recent use, now you go and test the same > > share of your pool." It doesn't translate well to other > > interpretations of the callbacks, although people have tried. > > Would it make sense for RCU to interpret two invocations within (say) > 100ms of each other as indicating urgency? (Hey, I had to ask!) > > > > > If it proves feasible, a later commit might add a function call directly > > > > indicating the end of the period of scarce memory. > > > > > > (Cc David Chinner, who often has opinions on shrinkers ;)) > > > > > > It's a bit abusive of the intent of the slab shrinkers, but I don't > > > immediately see a problem with it. Always returning 0 from > > > ->scan_objects might cause a problem in some situations(?). > > > > > > Perhaps we should have a formal "system getting low on memory, please > > > do something" notification API. > > > > It's tricky to find a useful definition of what low on memory > > means. In the past we've used sc->priority cutoffs, the vmpressure > > interface (reclaimed/scanned - reclaim efficiency cutoffs), oom > > notifiers (another reclaim efficiency cutoff). But none of these > > reliably capture "distress", and they vary highly between different > > hardware setups. It can be hard to trigger OOM itself on fast IO > > devices, even when the machine is way past useful (where useful is > > somewhat subjective to the user). Userspace OOM implementations that > > consider userspace health (also subjective) are getting more common. > > > > > How significant is this? How much memory can RCU consume? > > > > I think if rcu can end up consuming a significant share of memory, one > > way that may work would be to do proper shrinker integration and track > > the age of its objects relative to the age of other allocations in the > > system. I.e. toss them all on a clock list with "new" bits and shrink > > them at VM velocity. If the shrinker sees objects with new bit set, > > clear and rotate. If it sees objects without them, we know rcu_heads > > outlive cache pages etc. and should probably cycle faster too. > > It would be easy for RCU to pass back (or otherwise use) the age of the > current grace period, if that would help. > > Tracking the age of individual callbacks is out of the question due to > memory overhead, but RCU could approximate this via statistical sampling. > Comparing this to grace-period durations could give information as to > whether making grace periods go faster would be helpful. > > But, yes, it would be better to have an elusive unambiguous indication > of distress. ;-) And I have dropped this patch for the time being, but I do hope that it served a purpose in illustrating that it is not difficult to put RCU into a fast-but-inefficient mode when needed. Thanx, Paul ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode 2020-05-07 17:09 ` Paul E. McKenney 2020-05-07 17:29 ` Paul E. McKenney @ 2020-05-07 18:31 ` Johannes Weiner 2020-05-07 19:09 ` Paul E. McKenney 1 sibling, 1 reply; 20+ messages in thread From: Johannes Weiner @ 2020-05-07 18:31 UTC (permalink / raw) To: Paul E. McKenney Cc: Andrew Morton, rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, viro, Dave Chinner, Konstantin Khlebnikov On Thu, May 07, 2020 at 10:09:03AM -0700, Paul E. McKenney wrote: > On Thu, May 07, 2020 at 01:00:06PM -0400, Johannes Weiner wrote: > > On Wed, May 06, 2020 at 05:55:35PM -0700, Andrew Morton wrote: > > > On Wed, 6 May 2020 17:42:40 -0700 "Paul E. McKenney" <paulmck@kernel.org> wrote: > > > > > > > This commit adds a shrinker so as to inform RCU when memory is scarce. > > > > RCU responds by shifting into the same fast and inefficient mode that is > > > > used in the presence of excessive numbers of RCU callbacks. RCU remains > > > > in this state for one-tenth of a second, though this time window can be > > > > extended by another call to the shrinker. > > > > We may be able to use shrinkers here, but merely being invoked does > > not carry a reliable distress signal. > > > > Shrinkers get invoked whenever vmscan runs. It's a useful indicator > > for when to age an auxiliary LRU list - test references, clear and > > rotate or reclaim stale entries. The urgency, and what can and cannot > > be considered "stale", is encoded in the callback frequency and scan > > counts, and meant to be relative to the VM's own rate of aging: "I've > > tested X percent of mine for recent use, now you go and test the same > > share of your pool." It doesn't translate well to other > > interpretations of the callbacks, although people have tried. > > Would it make sense for RCU to interpret two invocations within (say) > 100ms of each other as indicating urgency? (Hey, I had to ask!) It's the perfect number for one combination of CPU, storage device, and shrinker implementation :-) > > > > If it proves feasible, a later commit might add a function call directly > > > > indicating the end of the period of scarce memory. > > > > > > (Cc David Chinner, who often has opinions on shrinkers ;)) > > > > > > It's a bit abusive of the intent of the slab shrinkers, but I don't > > > immediately see a problem with it. Always returning 0 from > > > ->scan_objects might cause a problem in some situations(?). > > > > > > Perhaps we should have a formal "system getting low on memory, please > > > do something" notification API. > > > > It's tricky to find a useful definition of what low on memory > > means. In the past we've used sc->priority cutoffs, the vmpressure > > interface (reclaimed/scanned - reclaim efficiency cutoffs), oom > > notifiers (another reclaim efficiency cutoff). But none of these > > reliably capture "distress", and they vary highly between different > > hardware setups. It can be hard to trigger OOM itself on fast IO > > devices, even when the machine is way past useful (where useful is > > somewhat subjective to the user). Userspace OOM implementations that > > consider userspace health (also subjective) are getting more common. > > > > > How significant is this? How much memory can RCU consume? > > > > I think if rcu can end up consuming a significant share of memory, one > > way that may work would be to do proper shrinker integration and track > > the age of its objects relative to the age of other allocations in the > > system. I.e. toss them all on a clock list with "new" bits and shrink > > them at VM velocity. If the shrinker sees objects with new bit set, > > clear and rotate. If it sees objects without them, we know rcu_heads > > outlive cache pages etc. and should probably cycle faster too. > > It would be easy for RCU to pass back (or otherwise use) the age of the > current grace period, if that would help. > > Tracking the age of individual callbacks is out of the question due to > memory overhead, but RCU could approximate this via statistical sampling. > Comparing this to grace-period durations could give information as to > whether making grace periods go faster would be helpful. That makes sense. So RCU knows the time and the VM knows the amount of memory. Either RCU needs to figure out its memory component to be able to translate shrinker input to age, or the VM needs to learn about time to be able to say: I'm currently scanning memory older than timestamp X. The latter would also require sampling in the VM. Nose goes. :-) There actually is prior art for teaching reclaim about time: https://lore.kernel.org/linux-mm/20130430110214.22179.26139.stgit@zurg/ CCing Konstantin. I'm curious how widely this ended up being used and how reliably it worked. > But, yes, it would be better to have an elusive unambiguous indication > of distress. ;-) I agree. Preferably something more practical than a dialogue box asking the user on how well things are going for them :-) ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode 2020-05-07 18:31 ` Johannes Weiner @ 2020-05-07 19:09 ` Paul E. McKenney 2020-05-08 9:00 ` Konstantin Khlebnikov 0 siblings, 1 reply; 20+ messages in thread From: Paul E. McKenney @ 2020-05-07 19:09 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, viro, Dave Chinner, Konstantin Khlebnikov On Thu, May 07, 2020 at 02:31:02PM -0400, Johannes Weiner wrote: > On Thu, May 07, 2020 at 10:09:03AM -0700, Paul E. McKenney wrote: > > On Thu, May 07, 2020 at 01:00:06PM -0400, Johannes Weiner wrote: > > > On Wed, May 06, 2020 at 05:55:35PM -0700, Andrew Morton wrote: > > > > On Wed, 6 May 2020 17:42:40 -0700 "Paul E. McKenney" <paulmck@kernel.org> wrote: > > > > > > > > > This commit adds a shrinker so as to inform RCU when memory is scarce. > > > > > RCU responds by shifting into the same fast and inefficient mode that is > > > > > used in the presence of excessive numbers of RCU callbacks. RCU remains > > > > > in this state for one-tenth of a second, though this time window can be > > > > > extended by another call to the shrinker. > > > > > > We may be able to use shrinkers here, but merely being invoked does > > > not carry a reliable distress signal. > > > > > > Shrinkers get invoked whenever vmscan runs. It's a useful indicator > > > for when to age an auxiliary LRU list - test references, clear and > > > rotate or reclaim stale entries. The urgency, and what can and cannot > > > be considered "stale", is encoded in the callback frequency and scan > > > counts, and meant to be relative to the VM's own rate of aging: "I've > > > tested X percent of mine for recent use, now you go and test the same > > > share of your pool." It doesn't translate well to other > > > interpretations of the callbacks, although people have tried. > > > > Would it make sense for RCU to interpret two invocations within (say) > > 100ms of each other as indicating urgency? (Hey, I had to ask!) > > It's the perfect number for one combination of CPU, storage device, > and shrinker implementation :-) Woo-hoo!!! But is that one combination actually in use anywhere? ;-) > > > > > If it proves feasible, a later commit might add a function call directly > > > > > indicating the end of the period of scarce memory. > > > > > > > > (Cc David Chinner, who often has opinions on shrinkers ;)) > > > > > > > > It's a bit abusive of the intent of the slab shrinkers, but I don't > > > > immediately see a problem with it. Always returning 0 from > > > > ->scan_objects might cause a problem in some situations(?). > > > > > > > > Perhaps we should have a formal "system getting low on memory, please > > > > do something" notification API. > > > > > > It's tricky to find a useful definition of what low on memory > > > means. In the past we've used sc->priority cutoffs, the vmpressure > > > interface (reclaimed/scanned - reclaim efficiency cutoffs), oom > > > notifiers (another reclaim efficiency cutoff). But none of these > > > reliably capture "distress", and they vary highly between different > > > hardware setups. It can be hard to trigger OOM itself on fast IO > > > devices, even when the machine is way past useful (where useful is > > > somewhat subjective to the user). Userspace OOM implementations that > > > consider userspace health (also subjective) are getting more common. > > > > > > > How significant is this? How much memory can RCU consume? > > > > > > I think if rcu can end up consuming a significant share of memory, one > > > way that may work would be to do proper shrinker integration and track > > > the age of its objects relative to the age of other allocations in the > > > system. I.e. toss them all on a clock list with "new" bits and shrink > > > them at VM velocity. If the shrinker sees objects with new bit set, > > > clear and rotate. If it sees objects without them, we know rcu_heads > > > outlive cache pages etc. and should probably cycle faster too. > > > > It would be easy for RCU to pass back (or otherwise use) the age of the > > current grace period, if that would help. > > > > Tracking the age of individual callbacks is out of the question due to > > memory overhead, but RCU could approximate this via statistical sampling. > > Comparing this to grace-period durations could give information as to > > whether making grace periods go faster would be helpful. > > That makes sense. > > So RCU knows the time and the VM knows the amount of memory. Either > RCU needs to figure out its memory component to be able to translate > shrinker input to age, or the VM needs to learn about time to be able > to say: I'm currently scanning memory older than timestamp X. > > The latter would also require sampling in the VM. Nose goes. :-) Sounds about right. ;-) Does reclaim have any notion of having continuously scanned for longer than some amount of time? Or could RCU reasonably deduce this? For example, if RCU noticed that reclaim had been scanning for longer than (say) five grace periods, RCU might decide to speed things up. But on the other hand, with slow disks, reclaim might go on for tens of seconds even without much in the way of memory pressure, mightn't it? I suppose that another indicator would be recent NULL returns from allocators. But that indicator flashes a bit later than one would like, doesn't it? And has false positives when allocators are invoked from atomic contexts, no doubt. And no doubt similar for sleeping more than a certain length of time in an allocator. > There actually is prior art for teaching reclaim about time: > https://lore.kernel.org/linux-mm/20130430110214.22179.26139.stgit@zurg/ > > CCing Konstantin. I'm curious how widely this ended up being used and > how reliably it worked. Looking forward to hearing of any results! > > But, yes, it would be better to have an elusive unambiguous indication > > of distress. ;-) > > I agree. Preferably something more practical than a dialogue box > asking the user on how well things are going for them :-) Indeed, that dialog box should be especially useful for things like light bulbs running Linux. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode 2020-05-07 19:09 ` Paul E. McKenney @ 2020-05-08 9:00 ` Konstantin Khlebnikov 2020-05-08 14:46 ` Paul E. McKenney 0 siblings, 1 reply; 20+ messages in thread From: Konstantin Khlebnikov @ 2020-05-08 9:00 UTC (permalink / raw) To: paulmck, Johannes Weiner Cc: Andrew Morton, rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, viro, Dave Chinner On 07/05/2020 22.09, Paul E. McKenney wrote: > On Thu, May 07, 2020 at 02:31:02PM -0400, Johannes Weiner wrote: >> On Thu, May 07, 2020 at 10:09:03AM -0700, Paul E. McKenney wrote: >>> On Thu, May 07, 2020 at 01:00:06PM -0400, Johannes Weiner wrote: >>>> On Wed, May 06, 2020 at 05:55:35PM -0700, Andrew Morton wrote: >>>>> On Wed, 6 May 2020 17:42:40 -0700 "Paul E. McKenney" <paulmck@kernel.org> wrote: >>>>> >>>>>> This commit adds a shrinker so as to inform RCU when memory is scarce. >>>>>> RCU responds by shifting into the same fast and inefficient mode that is >>>>>> used in the presence of excessive numbers of RCU callbacks. RCU remains >>>>>> in this state for one-tenth of a second, though this time window can be >>>>>> extended by another call to the shrinker. >>>> >>>> We may be able to use shrinkers here, but merely being invoked does >>>> not carry a reliable distress signal. >>>> >>>> Shrinkers get invoked whenever vmscan runs. It's a useful indicator >>>> for when to age an auxiliary LRU list - test references, clear and >>>> rotate or reclaim stale entries. The urgency, and what can and cannot >>>> be considered "stale", is encoded in the callback frequency and scan >>>> counts, and meant to be relative to the VM's own rate of aging: "I've >>>> tested X percent of mine for recent use, now you go and test the same >>>> share of your pool." It doesn't translate well to other >>>> interpretations of the callbacks, although people have tried. >>> >>> Would it make sense for RCU to interpret two invocations within (say) >>> 100ms of each other as indicating urgency? (Hey, I had to ask!) >> >> It's the perfect number for one combination of CPU, storage device, >> and shrinker implementation :-) > > Woo-hoo!!! > > But is that one combination actually in use anywhere? ;-) > >>>>>> If it proves feasible, a later commit might add a function call directly >>>>>> indicating the end of the period of scarce memory. >>>>> >>>>> (Cc David Chinner, who often has opinions on shrinkers ;)) >>>>> >>>>> It's a bit abusive of the intent of the slab shrinkers, but I don't >>>>> immediately see a problem with it. Always returning 0 from >>>>> ->scan_objects might cause a problem in some situations(?). >>>>> >>>>> Perhaps we should have a formal "system getting low on memory, please >>>>> do something" notification API. >>>> >>>> It's tricky to find a useful definition of what low on memory >>>> means. In the past we've used sc->priority cutoffs, the vmpressure >>>> interface (reclaimed/scanned - reclaim efficiency cutoffs), oom >>>> notifiers (another reclaim efficiency cutoff). But none of these >>>> reliably capture "distress", and they vary highly between different >>>> hardware setups. It can be hard to trigger OOM itself on fast IO >>>> devices, even when the machine is way past useful (where useful is >>>> somewhat subjective to the user). Userspace OOM implementations that >>>> consider userspace health (also subjective) are getting more common. >>>> >>>>> How significant is this? How much memory can RCU consume? >>>> >>>> I think if rcu can end up consuming a significant share of memory, one >>>> way that may work would be to do proper shrinker integration and track >>>> the age of its objects relative to the age of other allocations in the >>>> system. I.e. toss them all on a clock list with "new" bits and shrink >>>> them at VM velocity. If the shrinker sees objects with new bit set, >>>> clear and rotate. If it sees objects without them, we know rcu_heads >>>> outlive cache pages etc. and should probably cycle faster too. >>> >>> It would be easy for RCU to pass back (or otherwise use) the age of the >>> current grace period, if that would help. >>> >>> Tracking the age of individual callbacks is out of the question due to >>> memory overhead, but RCU could approximate this via statistical sampling. >>> Comparing this to grace-period durations could give information as to >>> whether making grace periods go faster would be helpful. >> >> That makes sense. >> >> So RCU knows the time and the VM knows the amount of memory. Either >> RCU needs to figure out its memory component to be able to translate >> shrinker input to age, or the VM needs to learn about time to be able >> to say: I'm currently scanning memory older than timestamp X. >> >> The latter would also require sampling in the VM. Nose goes. :-) > > Sounds about right. ;-) > > Does reclaim have any notion of having continuously scanned for > longer than some amount of time? Or could RCU reasonably deduce this? > For example, if RCU noticed that reclaim had been scanning for longer than > (say) five grace periods, RCU might decide to speed things up. > > But on the other hand, with slow disks, reclaim might go on for tens of > seconds even without much in the way of memory pressure, mightn't it? > > I suppose that another indicator would be recent NULL returns from > allocators. But that indicator flashes a bit later than one would like, > doesn't it? And has false positives when allocators are invoked from > atomic contexts, no doubt. And no doubt similar for sleeping more than > a certain length of time in an allocator. > >> There actually is prior art for teaching reclaim about time: >> https://lore.kernel.org/linux-mm/20130430110214.22179.26139.stgit@zurg/ >> >> CCing Konstantin. I'm curious how widely this ended up being used and >> how reliably it worked. > > Looking forward to hearing of any results! Well, that was some experiment about automatic steering memory pressure between containers. LRU timings from milestones itself worked pretty well. Remaining engine were more robust than mainline cgroups these days. Memory becomes much cheaper - I hope nobody want's overcommit it that badly anymore. It seems modern MM has plenty signals about memory pressure. Kswapsd should have enough knowledge to switch gears in RCU. > >>> But, yes, it would be better to have an elusive unambiguous indication >>> of distress. ;-) >> >> I agree. Preferably something more practical than a dialogue box >> asking the user on how well things are going for them :-) > > Indeed, that dialog box should be especially useful for things like > light bulbs running Linux. ;-) > > Thanx, Paul > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode 2020-05-08 9:00 ` Konstantin Khlebnikov @ 2020-05-08 14:46 ` Paul E. McKenney 2020-05-09 8:54 ` Konstantin Khlebnikov 0 siblings, 1 reply; 20+ messages in thread From: Paul E. McKenney @ 2020-05-08 14:46 UTC (permalink / raw) To: Konstantin Khlebnikov Cc: Johannes Weiner, Andrew Morton, rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, viro, Dave Chinner On Fri, May 08, 2020 at 12:00:28PM +0300, Konstantin Khlebnikov wrote: > On 07/05/2020 22.09, Paul E. McKenney wrote: > > On Thu, May 07, 2020 at 02:31:02PM -0400, Johannes Weiner wrote: > > > On Thu, May 07, 2020 at 10:09:03AM -0700, Paul E. McKenney wrote: > > > > On Thu, May 07, 2020 at 01:00:06PM -0400, Johannes Weiner wrote: > > > > > On Wed, May 06, 2020 at 05:55:35PM -0700, Andrew Morton wrote: > > > > > > On Wed, 6 May 2020 17:42:40 -0700 "Paul E. McKenney" <paulmck@kernel.org> wrote: > > > > > > > > > > > > > This commit adds a shrinker so as to inform RCU when memory is scarce. > > > > > > > RCU responds by shifting into the same fast and inefficient mode that is > > > > > > > used in the presence of excessive numbers of RCU callbacks. RCU remains > > > > > > > in this state for one-tenth of a second, though this time window can be > > > > > > > extended by another call to the shrinker. > > > > > > > > > > We may be able to use shrinkers here, but merely being invoked does > > > > > not carry a reliable distress signal. > > > > > > > > > > Shrinkers get invoked whenever vmscan runs. It's a useful indicator > > > > > for when to age an auxiliary LRU list - test references, clear and > > > > > rotate or reclaim stale entries. The urgency, and what can and cannot > > > > > be considered "stale", is encoded in the callback frequency and scan > > > > > counts, and meant to be relative to the VM's own rate of aging: "I've > > > > > tested X percent of mine for recent use, now you go and test the same > > > > > share of your pool." It doesn't translate well to other > > > > > interpretations of the callbacks, although people have tried. > > > > > > > > Would it make sense for RCU to interpret two invocations within (say) > > > > 100ms of each other as indicating urgency? (Hey, I had to ask!) > > > > > > It's the perfect number for one combination of CPU, storage device, > > > and shrinker implementation :-) > > > > Woo-hoo!!! > > > > But is that one combination actually in use anywhere? ;-) > > > > > > > > > If it proves feasible, a later commit might add a function call directly > > > > > > > indicating the end of the period of scarce memory. > > > > > > > > > > > > (Cc David Chinner, who often has opinions on shrinkers ;)) > > > > > > > > > > > > It's a bit abusive of the intent of the slab shrinkers, but I don't > > > > > > immediately see a problem with it. Always returning 0 from > > > > > > ->scan_objects might cause a problem in some situations(?). > > > > > > > > > > > > Perhaps we should have a formal "system getting low on memory, please > > > > > > do something" notification API. > > > > > > > > > > It's tricky to find a useful definition of what low on memory > > > > > means. In the past we've used sc->priority cutoffs, the vmpressure > > > > > interface (reclaimed/scanned - reclaim efficiency cutoffs), oom > > > > > notifiers (another reclaim efficiency cutoff). But none of these > > > > > reliably capture "distress", and they vary highly between different > > > > > hardware setups. It can be hard to trigger OOM itself on fast IO > > > > > devices, even when the machine is way past useful (where useful is > > > > > somewhat subjective to the user). Userspace OOM implementations that > > > > > consider userspace health (also subjective) are getting more common. > > > > > > > > > > > How significant is this? How much memory can RCU consume? > > > > > > > > > > I think if rcu can end up consuming a significant share of memory, one > > > > > way that may work would be to do proper shrinker integration and track > > > > > the age of its objects relative to the age of other allocations in the > > > > > system. I.e. toss them all on a clock list with "new" bits and shrink > > > > > them at VM velocity. If the shrinker sees objects with new bit set, > > > > > clear and rotate. If it sees objects without them, we know rcu_heads > > > > > outlive cache pages etc. and should probably cycle faster too. > > > > > > > > It would be easy for RCU to pass back (or otherwise use) the age of the > > > > current grace period, if that would help. > > > > > > > > Tracking the age of individual callbacks is out of the question due to > > > > memory overhead, but RCU could approximate this via statistical sampling. > > > > Comparing this to grace-period durations could give information as to > > > > whether making grace periods go faster would be helpful. > > > > > > That makes sense. > > > > > > So RCU knows the time and the VM knows the amount of memory. Either > > > RCU needs to figure out its memory component to be able to translate > > > shrinker input to age, or the VM needs to learn about time to be able > > > to say: I'm currently scanning memory older than timestamp X. > > > > > > The latter would also require sampling in the VM. Nose goes. :-) > > > > Sounds about right. ;-) > > > > Does reclaim have any notion of having continuously scanned for > > longer than some amount of time? Or could RCU reasonably deduce this? > > For example, if RCU noticed that reclaim had been scanning for longer than > > (say) five grace periods, RCU might decide to speed things up. > > > > But on the other hand, with slow disks, reclaim might go on for tens of > > seconds even without much in the way of memory pressure, mightn't it? > > > > I suppose that another indicator would be recent NULL returns from > > allocators. But that indicator flashes a bit later than one would like, > > doesn't it? And has false positives when allocators are invoked from > > atomic contexts, no doubt. And no doubt similar for sleeping more than > > a certain length of time in an allocator. > > > > > There actually is prior art for teaching reclaim about time: > > > https://lore.kernel.org/linux-mm/20130430110214.22179.26139.stgit@zurg/ > > > > > > CCing Konstantin. I'm curious how widely this ended up being used and > > > how reliably it worked. > > > > Looking forward to hearing of any results! > > Well, that was some experiment about automatic steering memory pressure > between containers. LRU timings from milestones itself worked pretty well. > Remaining engine were more robust than mainline cgroups these days. > Memory becomes much cheaper - I hope nobody want's overcommit it that badly anymore. > > It seems modern MM has plenty signals about memory pressure. > Kswapsd should have enough knowledge to switch gears in RCU. Easy for me to provide "start fast and inefficient mode" and "stop fast and inefficient mode" APIs for MM to call! How about rcu_mempressure_start() and rcu_mempressure_end()? I would expect them not to nest (as in if you need them to nest, please let me know). I would not expect these to be invoked all that often (as in if you do need them to be fast and scalable, please let me know). RCU would then be in fast/inefficient mode if either MM told it to be or if RCU had detected callback overload on at least one CPU. Seem reasonable? Thanx, Paul > > > > But, yes, it would be better to have an elusive unambiguous indication > > > > of distress. ;-) > > > > > > I agree. Preferably something more practical than a dialogue box > > > asking the user on how well things are going for them :-) > > > > Indeed, that dialog box should be especially useful for things like > > light bulbs running Linux. ;-) > > > > Thanx, Paul > > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode 2020-05-08 14:46 ` Paul E. McKenney @ 2020-05-09 8:54 ` Konstantin Khlebnikov 2020-05-09 16:09 ` Paul E. McKenney 0 siblings, 1 reply; 20+ messages in thread From: Konstantin Khlebnikov @ 2020-05-09 8:54 UTC (permalink / raw) To: paulmck Cc: Johannes Weiner, Andrew Morton, rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, viro, Dave Chinner On 08/05/2020 17.46, Paul E. McKenney wrote: > On Fri, May 08, 2020 at 12:00:28PM +0300, Konstantin Khlebnikov wrote: >> On 07/05/2020 22.09, Paul E. McKenney wrote: >>> On Thu, May 07, 2020 at 02:31:02PM -0400, Johannes Weiner wrote: >>>> On Thu, May 07, 2020 at 10:09:03AM -0700, Paul E. McKenney wrote: >>>>> On Thu, May 07, 2020 at 01:00:06PM -0400, Johannes Weiner wrote: >>>>>> On Wed, May 06, 2020 at 05:55:35PM -0700, Andrew Morton wrote: >>>>>>> On Wed, 6 May 2020 17:42:40 -0700 "Paul E. McKenney" <paulmck@kernel.org> wrote: >>>>>>> >>>>>>>> This commit adds a shrinker so as to inform RCU when memory is scarce. >>>>>>>> RCU responds by shifting into the same fast and inefficient mode that is >>>>>>>> used in the presence of excessive numbers of RCU callbacks. RCU remains >>>>>>>> in this state for one-tenth of a second, though this time window can be >>>>>>>> extended by another call to the shrinker. >>>>>> >>>>>> We may be able to use shrinkers here, but merely being invoked does >>>>>> not carry a reliable distress signal. >>>>>> >>>>>> Shrinkers get invoked whenever vmscan runs. It's a useful indicator >>>>>> for when to age an auxiliary LRU list - test references, clear and >>>>>> rotate or reclaim stale entries. The urgency, and what can and cannot >>>>>> be considered "stale", is encoded in the callback frequency and scan >>>>>> counts, and meant to be relative to the VM's own rate of aging: "I've >>>>>> tested X percent of mine for recent use, now you go and test the same >>>>>> share of your pool." It doesn't translate well to other >>>>>> interpretations of the callbacks, although people have tried. >>>>> >>>>> Would it make sense for RCU to interpret two invocations within (say) >>>>> 100ms of each other as indicating urgency? (Hey, I had to ask!) >>>> >>>> It's the perfect number for one combination of CPU, storage device, >>>> and shrinker implementation :-) >>> >>> Woo-hoo!!! >>> >>> But is that one combination actually in use anywhere? ;-) >>> >>>>>>>> If it proves feasible, a later commit might add a function call directly >>>>>>>> indicating the end of the period of scarce memory. >>>>>>> >>>>>>> (Cc David Chinner, who often has opinions on shrinkers ;)) >>>>>>> >>>>>>> It's a bit abusive of the intent of the slab shrinkers, but I don't >>>>>>> immediately see a problem with it. Always returning 0 from >>>>>>> ->scan_objects might cause a problem in some situations(?). >>>>>>> >>>>>>> Perhaps we should have a formal "system getting low on memory, please >>>>>>> do something" notification API. >>>>>> >>>>>> It's tricky to find a useful definition of what low on memory >>>>>> means. In the past we've used sc->priority cutoffs, the vmpressure >>>>>> interface (reclaimed/scanned - reclaim efficiency cutoffs), oom >>>>>> notifiers (another reclaim efficiency cutoff). But none of these >>>>>> reliably capture "distress", and they vary highly between different >>>>>> hardware setups. It can be hard to trigger OOM itself on fast IO >>>>>> devices, even when the machine is way past useful (where useful is >>>>>> somewhat subjective to the user). Userspace OOM implementations that >>>>>> consider userspace health (also subjective) are getting more common. >>>>>> >>>>>>> How significant is this? How much memory can RCU consume? >>>>>> >>>>>> I think if rcu can end up consuming a significant share of memory, one >>>>>> way that may work would be to do proper shrinker integration and track >>>>>> the age of its objects relative to the age of other allocations in the >>>>>> system. I.e. toss them all on a clock list with "new" bits and shrink >>>>>> them at VM velocity. If the shrinker sees objects with new bit set, >>>>>> clear and rotate. If it sees objects without them, we know rcu_heads >>>>>> outlive cache pages etc. and should probably cycle faster too. >>>>> >>>>> It would be easy for RCU to pass back (or otherwise use) the age of the >>>>> current grace period, if that would help. >>>>> >>>>> Tracking the age of individual callbacks is out of the question due to >>>>> memory overhead, but RCU could approximate this via statistical sampling. >>>>> Comparing this to grace-period durations could give information as to >>>>> whether making grace periods go faster would be helpful. >>>> >>>> That makes sense. >>>> >>>> So RCU knows the time and the VM knows the amount of memory. Either >>>> RCU needs to figure out its memory component to be able to translate >>>> shrinker input to age, or the VM needs to learn about time to be able >>>> to say: I'm currently scanning memory older than timestamp X. >>>> >>>> The latter would also require sampling in the VM. Nose goes. :-) >>> >>> Sounds about right. ;-) >>> >>> Does reclaim have any notion of having continuously scanned for >>> longer than some amount of time? Or could RCU reasonably deduce this? >>> For example, if RCU noticed that reclaim had been scanning for longer than >>> (say) five grace periods, RCU might decide to speed things up. >>> >>> But on the other hand, with slow disks, reclaim might go on for tens of >>> seconds even without much in the way of memory pressure, mightn't it? >>> >>> I suppose that another indicator would be recent NULL returns from >>> allocators. But that indicator flashes a bit later than one would like, >>> doesn't it? And has false positives when allocators are invoked from >>> atomic contexts, no doubt. And no doubt similar for sleeping more than >>> a certain length of time in an allocator. >>> >>>> There actually is prior art for teaching reclaim about time: >>>> https://lore.kernel.org/linux-mm/20130430110214.22179.26139.stgit@zurg/ >>>> >>>> CCing Konstantin. I'm curious how widely this ended up being used and >>>> how reliably it worked. >>> >>> Looking forward to hearing of any results! >> >> Well, that was some experiment about automatic steering memory pressure >> between containers. LRU timings from milestones itself worked pretty well. >> Remaining engine were more robust than mainline cgroups these days. >> Memory becomes much cheaper - I hope nobody want's overcommit it that badly anymore. >> >> It seems modern MM has plenty signals about memory pressure. >> Kswapsd should have enough knowledge to switch gears in RCU. > > Easy for me to provide "start fast and inefficient mode" and "stop fast > and inefficient mode" APIs for MM to call! > > How about rcu_mempressure_start() and rcu_mempressure_end()? I would > expect them not to nest (as in if you need them to nest, please let > me know). I would not expect these to be invoked all that often (as in > if you do need them to be fast and scalable, please let me know). > > RCU would then be in fast/inefficient mode if either MM told it to be > or if RCU had detected callback overload on at least one CPU. > > Seem reasonable? Not exactly nested calls, but kswapd threads are per numa node. So, at some level nodes under pressure must be counted. Also forcing rcu calls only for cpus in one numa node might be useful. I wonder if direct-reclaim should at some stage simply wait for RCU QS. I.e. call rcu_barrier() or similar somewhere before invoking OOM. All GFP_NOFAIL users should allow direct-reclaim, thus this loop in page_alloc shouldn't block RCU and doesn't need special care. > > Thanx, Paul > >>>>> But, yes, it would be better to have an elusive unambiguous indication >>>>> of distress. ;-) >>>> >>>> I agree. Preferably something more practical than a dialogue box >>>> asking the user on how well things are going for them :-) >>> >>> Indeed, that dialog box should be especially useful for things like >>> light bulbs running Linux. ;-) >>> >>> Thanx, Paul >>> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode 2020-05-09 8:54 ` Konstantin Khlebnikov @ 2020-05-09 16:09 ` Paul E. McKenney 2020-05-13 1:32 ` Dave Chinner 0 siblings, 1 reply; 20+ messages in thread From: Paul E. McKenney @ 2020-05-09 16:09 UTC (permalink / raw) To: Konstantin Khlebnikov Cc: Johannes Weiner, Andrew Morton, rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, viro, Dave Chinner On Sat, May 09, 2020 at 11:54:40AM +0300, Konstantin Khlebnikov wrote: > On 08/05/2020 17.46, Paul E. McKenney wrote: > > On Fri, May 08, 2020 at 12:00:28PM +0300, Konstantin Khlebnikov wrote: > > > On 07/05/2020 22.09, Paul E. McKenney wrote: > > > > On Thu, May 07, 2020 at 02:31:02PM -0400, Johannes Weiner wrote: > > > > > On Thu, May 07, 2020 at 10:09:03AM -0700, Paul E. McKenney wrote: > > > > > > On Thu, May 07, 2020 at 01:00:06PM -0400, Johannes Weiner wrote: > > > > > > > On Wed, May 06, 2020 at 05:55:35PM -0700, Andrew Morton wrote: > > > > > > > > On Wed, 6 May 2020 17:42:40 -0700 "Paul E. McKenney" <paulmck@kernel.org> wrote: > > > > > > > > > > > > > > > > > This commit adds a shrinker so as to inform RCU when memory is scarce. > > > > > > > > > RCU responds by shifting into the same fast and inefficient mode that is > > > > > > > > > used in the presence of excessive numbers of RCU callbacks. RCU remains > > > > > > > > > in this state for one-tenth of a second, though this time window can be > > > > > > > > > extended by another call to the shrinker. > > > > > > > > > > > > > > We may be able to use shrinkers here, but merely being invoked does > > > > > > > not carry a reliable distress signal. > > > > > > > > > > > > > > Shrinkers get invoked whenever vmscan runs. It's a useful indicator > > > > > > > for when to age an auxiliary LRU list - test references, clear and > > > > > > > rotate or reclaim stale entries. The urgency, and what can and cannot > > > > > > > be considered "stale", is encoded in the callback frequency and scan > > > > > > > counts, and meant to be relative to the VM's own rate of aging: "I've > > > > > > > tested X percent of mine for recent use, now you go and test the same > > > > > > > share of your pool." It doesn't translate well to other > > > > > > > interpretations of the callbacks, although people have tried. > > > > > > > > > > > > Would it make sense for RCU to interpret two invocations within (say) > > > > > > 100ms of each other as indicating urgency? (Hey, I had to ask!) > > > > > > > > > > It's the perfect number for one combination of CPU, storage device, > > > > > and shrinker implementation :-) > > > > > > > > Woo-hoo!!! > > > > > > > > But is that one combination actually in use anywhere? ;-) > > > > > > > > > > > > > If it proves feasible, a later commit might add a function call directly > > > > > > > > > indicating the end of the period of scarce memory. > > > > > > > > > > > > > > > > (Cc David Chinner, who often has opinions on shrinkers ;)) > > > > > > > > > > > > > > > > It's a bit abusive of the intent of the slab shrinkers, but I don't > > > > > > > > immediately see a problem with it. Always returning 0 from > > > > > > > > ->scan_objects might cause a problem in some situations(?). > > > > > > > > > > > > > > > > Perhaps we should have a formal "system getting low on memory, please > > > > > > > > do something" notification API. > > > > > > > > > > > > > > It's tricky to find a useful definition of what low on memory > > > > > > > means. In the past we've used sc->priority cutoffs, the vmpressure > > > > > > > interface (reclaimed/scanned - reclaim efficiency cutoffs), oom > > > > > > > notifiers (another reclaim efficiency cutoff). But none of these > > > > > > > reliably capture "distress", and they vary highly between different > > > > > > > hardware setups. It can be hard to trigger OOM itself on fast IO > > > > > > > devices, even when the machine is way past useful (where useful is > > > > > > > somewhat subjective to the user). Userspace OOM implementations that > > > > > > > consider userspace health (also subjective) are getting more common. > > > > > > > > > > > > > > > How significant is this? How much memory can RCU consume? > > > > > > > > > > > > > > I think if rcu can end up consuming a significant share of memory, one > > > > > > > way that may work would be to do proper shrinker integration and track > > > > > > > the age of its objects relative to the age of other allocations in the > > > > > > > system. I.e. toss them all on a clock list with "new" bits and shrink > > > > > > > them at VM velocity. If the shrinker sees objects with new bit set, > > > > > > > clear and rotate. If it sees objects without them, we know rcu_heads > > > > > > > outlive cache pages etc. and should probably cycle faster too. > > > > > > > > > > > > It would be easy for RCU to pass back (or otherwise use) the age of the > > > > > > current grace period, if that would help. > > > > > > > > > > > > Tracking the age of individual callbacks is out of the question due to > > > > > > memory overhead, but RCU could approximate this via statistical sampling. > > > > > > Comparing this to grace-period durations could give information as to > > > > > > whether making grace periods go faster would be helpful. > > > > > > > > > > That makes sense. > > > > > > > > > > So RCU knows the time and the VM knows the amount of memory. Either > > > > > RCU needs to figure out its memory component to be able to translate > > > > > shrinker input to age, or the VM needs to learn about time to be able > > > > > to say: I'm currently scanning memory older than timestamp X. > > > > > > > > > > The latter would also require sampling in the VM. Nose goes. :-) > > > > > > > > Sounds about right. ;-) > > > > > > > > Does reclaim have any notion of having continuously scanned for > > > > longer than some amount of time? Or could RCU reasonably deduce this? > > > > For example, if RCU noticed that reclaim had been scanning for longer than > > > > (say) five grace periods, RCU might decide to speed things up. > > > > > > > > But on the other hand, with slow disks, reclaim might go on for tens of > > > > seconds even without much in the way of memory pressure, mightn't it? > > > > > > > > I suppose that another indicator would be recent NULL returns from > > > > allocators. But that indicator flashes a bit later than one would like, > > > > doesn't it? And has false positives when allocators are invoked from > > > > atomic contexts, no doubt. And no doubt similar for sleeping more than > > > > a certain length of time in an allocator. > > > > > > > > > There actually is prior art for teaching reclaim about time: > > > > > https://lore.kernel.org/linux-mm/20130430110214.22179.26139.stgit@zurg/ > > > > > > > > > > CCing Konstantin. I'm curious how widely this ended up being used and > > > > > how reliably it worked. > > > > > > > > Looking forward to hearing of any results! > > > > > > Well, that was some experiment about automatic steering memory pressure > > > between containers. LRU timings from milestones itself worked pretty well. > > > Remaining engine were more robust than mainline cgroups these days. > > > Memory becomes much cheaper - I hope nobody want's overcommit it that badly anymore. > > > > > > It seems modern MM has plenty signals about memory pressure. > > > Kswapsd should have enough knowledge to switch gears in RCU. > > > > Easy for me to provide "start fast and inefficient mode" and "stop fast > > and inefficient mode" APIs for MM to call! > > > > How about rcu_mempressure_start() and rcu_mempressure_end()? I would > > expect them not to nest (as in if you need them to nest, please let > > me know). I would not expect these to be invoked all that often (as in > > if you do need them to be fast and scalable, please let me know). > > > RCU would then be in fast/inefficient mode if either MM told it to be > > or if RCU had detected callback overload on at least one CPU. > > > > Seem reasonable? > > Not exactly nested calls, but kswapd threads are per numa node. > So, at some level nodes under pressure must be counted. Easy enough, especially given that RCU already "counts" CPUs having excessive numbers of callbacks. But assuming that the transitions to/from OOM are rare, I would start by just counting them with a global counter. If the counter is non-zero, RCU is in fast and inefficient mode. > Also forcing rcu calls only for cpus in one numa node might be useful. Interesting. RCU currently evaluates a given CPU by comparing the number of callbacks against a fixed cutoff that can be set at boot using rcutree.qhimark, which defaults to 10,000. When this cutoff is exceeded, RCU becomes more aggressive about invoking callbacks on that CPU, for example, by sacrificing some degree of real-time response. I believe that this heuristic would also serve the OOM use case well. > I wonder if direct-reclaim should at some stage simply wait for RCU QS. > I.e. call rcu_barrier() or similar somewhere before invoking OOM. The rcu_oom_count() function in the patch starting this thread returns the total number of outstanding callbacks queued on all CPUs. So one approach would be to invoke this function, and if the return value was truly huge (taking size of memory and who knows that all else into account), do the rcu_barrier() to wait for RCU to clear its current backlog. On the NUMA point, it would be dead easy for me to supply a function that returned the number of callbacks on a given CPU, which would allow you to similarly evaluate a NUMA node, a cgroup, or whatever. > All GFP_NOFAIL users should allow direct-reclaim, thus this loop > in page_alloc shouldn't block RCU and doesn't need special care. I must defer to you guys on this. The main caution is the duration of direct reclaim. After all, if it is too long, the kfree_rcu() instance would have been better of just invoking synchronize_rcu(). Thanx, Paul > > > > > > But, yes, it would be better to have an elusive unambiguous indication > > > > > > of distress. ;-) > > > > > > > > > > I agree. Preferably something more practical than a dialogue box > > > > > asking the user on how well things are going for them :-) > > > > > > > > Indeed, that dialog box should be especially useful for things like > > > > light bulbs running Linux. ;-) > > > > > > > > Thanx, Paul > > > > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode 2020-05-09 16:09 ` Paul E. McKenney @ 2020-05-13 1:32 ` Dave Chinner 2020-05-13 3:18 ` Paul E. McKenney 0 siblings, 1 reply; 20+ messages in thread From: Dave Chinner @ 2020-05-13 1:32 UTC (permalink / raw) To: Paul E. McKenney Cc: Konstantin Khlebnikov, Johannes Weiner, Andrew Morton, rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, viro On Sat, May 09, 2020 at 09:09:00AM -0700, Paul E. McKenney wrote: > On Sat, May 09, 2020 at 11:54:40AM +0300, Konstantin Khlebnikov wrote: > > On 08/05/2020 17.46, Paul E. McKenney wrote: > > > Easy for me to provide "start fast and inefficient mode" and "stop fast > > > and inefficient mode" APIs for MM to call! > > > > > > How about rcu_mempressure_start() and rcu_mempressure_end()? I would > > > expect them not to nest (as in if you need them to nest, please let > > > me know). I would not expect these to be invoked all that often (as in > > > if you do need them to be fast and scalable, please let me know). > > > > RCU would then be in fast/inefficient mode if either MM told it to be > > > or if RCU had detected callback overload on at least one CPU. > > > > > > Seem reasonable? > > > > Not exactly nested calls, but kswapd threads are per numa node. > > So, at some level nodes under pressure must be counted. > > Easy enough, especially given that RCU already "counts" CPUs having > excessive numbers of callbacks. But assuming that the transitions to/from > OOM are rare, I would start by just counting them with a global counter. > If the counter is non-zero, RCU is in fast and inefficient mode. > > > Also forcing rcu calls only for cpus in one numa node might be useful. > > Interesting. RCU currently evaluates a given CPU by comparing the > number of callbacks against a fixed cutoff that can be set at boot using > rcutree.qhimark, which defaults to 10,000. When this cutoff is exceeded, > RCU becomes more aggressive about invoking callbacks on that CPU, for > example, by sacrificing some degree of real-time response. I believe > that this heuristic would also serve the OOM use case well. So one of the things that I'm not sure people have connected here is that memory reclaim done by shrinkers is one of the things that drives huge numbers of call_rcu() callbacks to free memory via rcu. If we are reclaiming dentries and inodes, then we can be pushing thousands to hundreds of thousands of objects into kfree_rcu() and/or direct call_rcu() calls to free these objects in a single reclaim pass. Hence the trigger for RCU going into "excessive callback" mode might, in fact, be kswapd running a pass over the shrinkers. i.e. memory reclaim itself can be responsible for pushing RCU into this "OOM pressure" situation. So perhaps we've missed a trick here by not having the memory reclaim routines trigger RCU callbacks at the end of a priority scan. The shrinkers have queued the objects for freeing, but they haven't actually been freed yet and so things like slab pages haven't actually been returned to the free pool even though the shrinkers have said "freed this many objects"... i.e. perhaps the right solution here is a "rcu_run_callbacks()" function that memory reclaim calls before backing off and/or winding up reclaim priority. > > I wonder if direct-reclaim should at some stage simply wait for RCU QS. > > I.e. call rcu_barrier() or similar somewhere before invoking OOM. > > The rcu_oom_count() function in the patch starting this thread returns the > total number of outstanding callbacks queued on all CPUs. So one approach > would be to invoke this function, and if the return value was truly > huge (taking size of memory and who knows that all else into account), > do the rcu_barrier() to wait for RCU to clear its current backlog. The shrinker scan control structure has a node mask in it to indicate what node (and hence CPUs) it should be reclaiming from. This information comes from the main reclaim scan routine, so it would be trivial to feed straight into the RCU code to have it act on just the CPUs/node that we are reclaiming memory from... > On the NUMA point, it would be dead easy for me to supply a function > that returned the number of callbacks on a given CPU, which would allow > you to similarly evaluate a NUMA node, a cgroup, or whatever. I'd think it runs the other way around - we optimisitically call the RCU layer to do cleanup, and the RCU layer decides if there's enough queued callbacks on the cpus/node to run callbacks immediately. It would even be provided with the scan priority to indicate the level of desperation memory reclaim is under.... > > All GFP_NOFAIL users should allow direct-reclaim, thus this loop > > in page_alloc shouldn't block RCU and doesn't need special care. > > I must defer to you guys on this. The main caution is the duration of > direct reclaim. After all, if it is too long, the kfree_rcu() instance > would have been better of just invoking synchronize_rcu(). Individual callers of kfree_rcu() have no idea of the load on RCU, nor how long direct reclaim is taking. Calling synchronize_rcu() incorrectly has pretty major downsides to it, so nobody should be trying to expedite kfree_rcu() unless there is a good reason to do so (e.g. at unmount to ensure everything allocated by a filesystem has actually been freed). Hence I'd much prefer the decision to expedite callbacks is made by the RCU subsystem based on it's known callback load and some indication of how close memory reclaim is to declaring OOM... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode 2020-05-13 1:32 ` Dave Chinner @ 2020-05-13 3:18 ` Paul E. McKenney 2020-05-13 4:35 ` Konstantin Khlebnikov 2020-05-13 5:07 ` Dave Chinner 0 siblings, 2 replies; 20+ messages in thread From: Paul E. McKenney @ 2020-05-13 3:18 UTC (permalink / raw) To: Dave Chinner Cc: Konstantin Khlebnikov, Johannes Weiner, Andrew Morton, rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, viro On Wed, May 13, 2020 at 11:32:38AM +1000, Dave Chinner wrote: > On Sat, May 09, 2020 at 09:09:00AM -0700, Paul E. McKenney wrote: > > On Sat, May 09, 2020 at 11:54:40AM +0300, Konstantin Khlebnikov wrote: > > > On 08/05/2020 17.46, Paul E. McKenney wrote: > > > > Easy for me to provide "start fast and inefficient mode" and "stop fast > > > > and inefficient mode" APIs for MM to call! > > > > > > > > How about rcu_mempressure_start() and rcu_mempressure_end()? I would > > > > expect them not to nest (as in if you need them to nest, please let > > > > me know). I would not expect these to be invoked all that often (as in > > > > if you do need them to be fast and scalable, please let me know). > > > > > RCU would then be in fast/inefficient mode if either MM told it to be > > > > or if RCU had detected callback overload on at least one CPU. > > > > > > > > Seem reasonable? > > > > > > Not exactly nested calls, but kswapd threads are per numa node. > > > So, at some level nodes under pressure must be counted. > > > > Easy enough, especially given that RCU already "counts" CPUs having > > excessive numbers of callbacks. But assuming that the transitions to/from > > OOM are rare, I would start by just counting them with a global counter. > > If the counter is non-zero, RCU is in fast and inefficient mode. > > > > > Also forcing rcu calls only for cpus in one numa node might be useful. > > > > Interesting. RCU currently evaluates a given CPU by comparing the > > number of callbacks against a fixed cutoff that can be set at boot using > > rcutree.qhimark, which defaults to 10,000. When this cutoff is exceeded, > > RCU becomes more aggressive about invoking callbacks on that CPU, for > > example, by sacrificing some degree of real-time response. I believe > > that this heuristic would also serve the OOM use case well. > > So one of the things that I'm not sure people have connected here is > that memory reclaim done by shrinkers is one of the things that > drives huge numbers of call_rcu() callbacks to free memory via rcu. > If we are reclaiming dentries and inodes, then we can be pushing > thousands to hundreds of thousands of objects into kfree_rcu() > and/or direct call_rcu() calls to free these objects in a single > reclaim pass. Good point! > Hence the trigger for RCU going into "excessive callback" mode > might, in fact, be kswapd running a pass over the shrinkers. i.e. > memory reclaim itself can be responsible for pushing RCU into this "OOM > pressure" situation. > > So perhaps we've missed a trick here by not having the memory > reclaim routines trigger RCU callbacks at the end of a priority > scan. The shrinkers have queued the objects for freeing, but they > haven't actually been freed yet and so things like slab pages > haven't actually been returned to the free pool even though the > shrinkers have said "freed this many objects"... > > i.e. perhaps the right solution here is a "rcu_run_callbacks()" > function that memory reclaim calls before backing off and/or winding > up reclaim priority. It would not be hard to make something that put RCU into fast/inefficient mode for a couple of grace periods. I will also look into the possibility of speeding up callback invocation. It might also make sense to put RCU grace periods into fast mode while running the shrinkers that are freeing dentries and inodes. However, kbuild test robot reports ugly regressions when putting RCU into fast/inefficient mode to quickly and too often. As in 78.5% degradation on one of the benchmarks. > > > I wonder if direct-reclaim should at some stage simply wait for RCU QS. > > > I.e. call rcu_barrier() or similar somewhere before invoking OOM. > > > > The rcu_oom_count() function in the patch starting this thread returns the > > total number of outstanding callbacks queued on all CPUs. So one approach > > would be to invoke this function, and if the return value was truly > > huge (taking size of memory and who knows that all else into account), > > do the rcu_barrier() to wait for RCU to clear its current backlog. > > The shrinker scan control structure has a node mask in it to > indicate what node (and hence CPUs) it should be reclaiming from. > This information comes from the main reclaim scan routine, so it > would be trivial to feed straight into the RCU code to have it > act on just the CPUs/node that we are reclaiming memory from... For the callbacks, RCU can operate on CPUs, in theory anyway. The grace period itself, however, is inherently global. > > On the NUMA point, it would be dead easy for me to supply a function > > that returned the number of callbacks on a given CPU, which would allow > > you to similarly evaluate a NUMA node, a cgroup, or whatever. > > I'd think it runs the other way around - we optimisitically call the > RCU layer to do cleanup, and the RCU layer decides if there's enough > queued callbacks on the cpus/node to run callbacks immediately. It > would even be provided with the scan priority to indicate the level > of desperation memory reclaim is under.... Easy for RCU to count the number of callbacks. That said, it has no idea which callbacks are which. Perhaps kfree_rcu() could gather that information from the slab allocator, though. > > > All GFP_NOFAIL users should allow direct-reclaim, thus this loop > > > in page_alloc shouldn't block RCU and doesn't need special care. > > > > I must defer to you guys on this. The main caution is the duration of > > direct reclaim. After all, if it is too long, the kfree_rcu() instance > > would have been better of just invoking synchronize_rcu(). > > Individual callers of kfree_rcu() have no idea of the load on RCU, > nor how long direct reclaim is taking. Calling synchronize_rcu() > incorrectly has pretty major downsides to it, so nobody should be > trying to expedite kfree_rcu() unless there is a good reason to do > so (e.g. at unmount to ensure everything allocated by a filesystem > has actually been freed). Hence I'd much prefer the decision to > expedite callbacks is made by the RCU subsystem based on it's known > callback load and some indication of how close memory reclaim is to > declaring OOM... Sorry, I was unclear. There is a new single-argument kfree_rcu() under way that does not require an rcu_head in the structure being freed. However, in this case, kfree_rcu() might either allocate the memory that is needed to track the memory to be freed on the one hand or just invoke synchronize_rcu() on the other. So this decision would be taken inside kfree_rcu(), and not be visible to either core RCU or the caller of kfree_rcu(). This decision is made based on whether or not the allocator provides kfree_rcu() the memory needed. The tradeoff is what GFP flags are supplied. So the question kfree_rcu() has to answer is "Would it be better to give myself to reclaim as an additional task, or would it instead be better to just invoke synchronize_rcu() and then immediately free()?" I am probably still unclear, but hopefully at least one step in the right direction. Thanx, Paul ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode 2020-05-13 3:18 ` Paul E. McKenney @ 2020-05-13 4:35 ` Konstantin Khlebnikov 2020-05-13 12:52 ` Paul E. McKenney 2020-05-13 5:07 ` Dave Chinner 1 sibling, 1 reply; 20+ messages in thread From: Konstantin Khlebnikov @ 2020-05-13 4:35 UTC (permalink / raw) To: paulmck, Dave Chinner Cc: Johannes Weiner, Andrew Morton, rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, viro On 13/05/2020 06.18, Paul E. McKenney wrote: > On Wed, May 13, 2020 at 11:32:38AM +1000, Dave Chinner wrote: >> On Sat, May 09, 2020 at 09:09:00AM -0700, Paul E. McKenney wrote: >>> On Sat, May 09, 2020 at 11:54:40AM +0300, Konstantin Khlebnikov wrote: >>>> On 08/05/2020 17.46, Paul E. McKenney wrote: >>>>> Easy for me to provide "start fast and inefficient mode" and "stop fast >>>>> and inefficient mode" APIs for MM to call! >>>>> >>>>> How about rcu_mempressure_start() and rcu_mempressure_end()? I would >>>>> expect them not to nest (as in if you need them to nest, please let >>>>> me know). I would not expect these to be invoked all that often (as in >>>>> if you do need them to be fast and scalable, please let me know). > >>>>> RCU would then be in fast/inefficient mode if either MM told it to be >>>>> or if RCU had detected callback overload on at least one CPU. >>>>> >>>>> Seem reasonable? >>>> >>>> Not exactly nested calls, but kswapd threads are per numa node. >>>> So, at some level nodes under pressure must be counted. >>> >>> Easy enough, especially given that RCU already "counts" CPUs having >>> excessive numbers of callbacks. But assuming that the transitions to/from >>> OOM are rare, I would start by just counting them with a global counter. >>> If the counter is non-zero, RCU is in fast and inefficient mode. >>> >>>> Also forcing rcu calls only for cpus in one numa node might be useful. >>> >>> Interesting. RCU currently evaluates a given CPU by comparing the >>> number of callbacks against a fixed cutoff that can be set at boot using >>> rcutree.qhimark, which defaults to 10,000. When this cutoff is exceeded, >>> RCU becomes more aggressive about invoking callbacks on that CPU, for >>> example, by sacrificing some degree of real-time response. I believe >>> that this heuristic would also serve the OOM use case well. >> >> So one of the things that I'm not sure people have connected here is >> that memory reclaim done by shrinkers is one of the things that >> drives huge numbers of call_rcu() callbacks to free memory via rcu. >> If we are reclaiming dentries and inodes, then we can be pushing >> thousands to hundreds of thousands of objects into kfree_rcu() >> and/or direct call_rcu() calls to free these objects in a single >> reclaim pass. > > Good point! Indeed > >> Hence the trigger for RCU going into "excessive callback" mode >> might, in fact, be kswapd running a pass over the shrinkers. i.e. >> memory reclaim itself can be responsible for pushing RCU into this "OOM >> pressure" situation. >> >> So perhaps we've missed a trick here by not having the memory >> reclaim routines trigger RCU callbacks at the end of a priority >> scan. The shrinkers have queued the objects for freeing, but they >> haven't actually been freed yet and so things like slab pages >> haven't actually been returned to the free pool even though the >> shrinkers have said "freed this many objects"... >> >> i.e. perhaps the right solution here is a "rcu_run_callbacks()" >> function that memory reclaim calls before backing off and/or winding >> up reclaim priority. > > It would not be hard to make something that put RCU into fast/inefficient > mode for a couple of grace periods. I will also look into the possibility > of speeding up callback invocation. > > It might also make sense to put RCU grace periods into fast mode while > running the shrinkers that are freeing dentries and inodes. However, > kbuild test robot reports ugly regressions when putting RCU into > fast/inefficient mode to quickly and too often. As in 78.5% degradation > on one of the benchmarks. I think fast/inefficient mode here just an optimization for freeing memory faster. It doesn't solve the problem itself. At first we have to close the loop in reclaimer and actually wait or run rcu callbacks which might free memory before increasing priority and invoking OOM killer. > >>>> I wonder if direct-reclaim should at some stage simply wait for RCU QS. >>>> I.e. call rcu_barrier() or similar somewhere before invoking OOM. >>> >>> The rcu_oom_count() function in the patch starting this thread returns the >>> total number of outstanding callbacks queued on all CPUs. So one approach >>> would be to invoke this function, and if the return value was truly >>> huge (taking size of memory and who knows that all else into account), >>> do the rcu_barrier() to wait for RCU to clear its current backlog. >> >> The shrinker scan control structure has a node mask in it to >> indicate what node (and hence CPUs) it should be reclaiming from. >> This information comes from the main reclaim scan routine, so it >> would be trivial to feed straight into the RCU code to have it >> act on just the CPUs/node that we are reclaiming memory from... > > For the callbacks, RCU can operate on CPUs, in theory anyway. The > grace period itself, however, is inherently global. > >>> On the NUMA point, it would be dead easy for me to supply a function >>> that returned the number of callbacks on a given CPU, which would allow >>> you to similarly evaluate a NUMA node, a cgroup, or whatever. >> >> I'd think it runs the other way around - we optimisitically call the >> RCU layer to do cleanup, and the RCU layer decides if there's enough >> queued callbacks on the cpus/node to run callbacks immediately. It >> would even be provided with the scan priority to indicate the level >> of desperation memory reclaim is under.... > > Easy for RCU to count the number of callbacks. That said, it has no > idea which callbacks are which. Perhaps kfree_rcu() could gather that > information from the slab allocator, though. It's simple to mark slab shrinkers that frees object through RCU and count freed objects in reclaimer: --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -536,6 +536,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, else new_nr = atomic_long_read(&shrinker->nr_deferred[nid]); + if (shrinker->flags & SHRINKER_KFREE_RCU) + shrinkctl->nr_kfree_rcu += freed; + trace_mm_shrink_slab_end(shrinker, nid, freed, nr, new_nr, total_scan); return freed; } And when accumulated enough do some synchronization. Probably it's better to sum freed objects at per-cpu variable, and accumulate size rather than count. > >>>> All GFP_NOFAIL users should allow direct-reclaim, thus this loop >>>> in page_alloc shouldn't block RCU and doesn't need special care. >>> >>> I must defer to you guys on this. The main caution is the duration of >>> direct reclaim. After all, if it is too long, the kfree_rcu() instance >>> would have been better of just invoking synchronize_rcu(). >> >> Individual callers of kfree_rcu() have no idea of the load on RCU, >> nor how long direct reclaim is taking. Calling synchronize_rcu() >> incorrectly has pretty major downsides to it, so nobody should be >> trying to expedite kfree_rcu() unless there is a good reason to do >> so (e.g. at unmount to ensure everything allocated by a filesystem >> has actually been freed). Hence I'd much prefer the decision to >> expedite callbacks is made by the RCU subsystem based on it's known >> callback load and some indication of how close memory reclaim is to >> declaring OOM... > > Sorry, I was unclear. There is a new single-argument kfree_rcu() under > way that does not require an rcu_head in the structure being freed. > However, in this case, kfree_rcu() might either allocate the memory > that is needed to track the memory to be freed on the one hand or just > invoke synchronize_rcu() on the other. So this decision would be taken > inside kfree_rcu(), and not be visible to either core RCU or the caller > of kfree_rcu(). > > This decision is made based on whether or not the allocator provides > kfree_rcu() the memory needed. The tradeoff is what GFP flags are > supplied. So the question kfree_rcu() has to answer is "Would it be > better to give myself to reclaim as an additional task, or would it > instead be better to just invoke synchronize_rcu() and then immediately > free()?" > > I am probably still unclear, but hopefully at least one step in the > right direction. > > Thanx, Paul > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode 2020-05-13 4:35 ` Konstantin Khlebnikov @ 2020-05-13 12:52 ` Paul E. McKenney 0 siblings, 0 replies; 20+ messages in thread From: Paul E. McKenney @ 2020-05-13 12:52 UTC (permalink / raw) To: Konstantin Khlebnikov Cc: Dave Chinner, Johannes Weiner, Andrew Morton, rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, viro On Wed, May 13, 2020 at 07:35:25AM +0300, Konstantin Khlebnikov wrote: > On 13/05/2020 06.18, Paul E. McKenney wrote: > > On Wed, May 13, 2020 at 11:32:38AM +1000, Dave Chinner wrote: > > > On Sat, May 09, 2020 at 09:09:00AM -0700, Paul E. McKenney wrote: > > > > On Sat, May 09, 2020 at 11:54:40AM +0300, Konstantin Khlebnikov wrote: > > > > > On 08/05/2020 17.46, Paul E. McKenney wrote: > > > > > > Easy for me to provide "start fast and inefficient mode" and "stop fast > > > > > > and inefficient mode" APIs for MM to call! > > > > > > > > > > > > How about rcu_mempressure_start() and rcu_mempressure_end()? I would > > > > > > expect them not to nest (as in if you need them to nest, please let > > > > > > me know). I would not expect these to be invoked all that often (as in > > > > > > if you do need them to be fast and scalable, please let me know). > > > > > > > RCU would then be in fast/inefficient mode if either MM told it to be > > > > > > or if RCU had detected callback overload on at least one CPU. > > > > > > > > > > > > Seem reasonable? > > > > > > > > > > Not exactly nested calls, but kswapd threads are per numa node. > > > > > So, at some level nodes under pressure must be counted. > > > > > > > > Easy enough, especially given that RCU already "counts" CPUs having > > > > excessive numbers of callbacks. But assuming that the transitions to/from > > > > OOM are rare, I would start by just counting them with a global counter. > > > > If the counter is non-zero, RCU is in fast and inefficient mode. > > > > > > > > > Also forcing rcu calls only for cpus in one numa node might be useful. > > > > > > > > Interesting. RCU currently evaluates a given CPU by comparing the > > > > number of callbacks against a fixed cutoff that can be set at boot using > > > > rcutree.qhimark, which defaults to 10,000. When this cutoff is exceeded, > > > > RCU becomes more aggressive about invoking callbacks on that CPU, for > > > > example, by sacrificing some degree of real-time response. I believe > > > > that this heuristic would also serve the OOM use case well. > > > > > > So one of the things that I'm not sure people have connected here is > > > that memory reclaim done by shrinkers is one of the things that > > > drives huge numbers of call_rcu() callbacks to free memory via rcu. > > > If we are reclaiming dentries and inodes, then we can be pushing > > > thousands to hundreds of thousands of objects into kfree_rcu() > > > and/or direct call_rcu() calls to free these objects in a single > > > reclaim pass. > > > > Good point! > > Indeed > > > > > > Hence the trigger for RCU going into "excessive callback" mode > > > might, in fact, be kswapd running a pass over the shrinkers. i.e. > > > memory reclaim itself can be responsible for pushing RCU into this "OOM > > > pressure" situation. > > > > > > So perhaps we've missed a trick here by not having the memory > > > reclaim routines trigger RCU callbacks at the end of a priority > > > scan. The shrinkers have queued the objects for freeing, but they > > > haven't actually been freed yet and so things like slab pages > > > haven't actually been returned to the free pool even though the > > > shrinkers have said "freed this many objects"... > > > > > > i.e. perhaps the right solution here is a "rcu_run_callbacks()" > > > function that memory reclaim calls before backing off and/or winding > > > up reclaim priority. > > > > It would not be hard to make something that put RCU into fast/inefficient > > mode for a couple of grace periods. I will also look into the possibility > > of speeding up callback invocation. > > > > It might also make sense to put RCU grace periods into fast mode while > > running the shrinkers that are freeing dentries and inodes. However, > > kbuild test robot reports ugly regressions when putting RCU into > > fast/inefficient mode to quickly and too often. As in 78.5% degradation > > on one of the benchmarks. > > I think fast/inefficient mode here just an optimization for freeing > memory faster. It doesn't solve the problem itself. > > At first we have to close the loop in reclaimer and actually wait or run > rcu callbacks which might free memory before increasing priority and > invoking OOM killer. That is easy, just invoke rcu_barrier(), which will block until all prior call_rcu()/kfree_rcu() callbacks have been invoked. > > > > > I wonder if direct-reclaim should at some stage simply wait for RCU QS. > > > > > I.e. call rcu_barrier() or similar somewhere before invoking OOM. > > > > > > > > The rcu_oom_count() function in the patch starting this thread returns the > > > > total number of outstanding callbacks queued on all CPUs. So one approach > > > > would be to invoke this function, and if the return value was truly > > > > huge (taking size of memory and who knows that all else into account), > > > > do the rcu_barrier() to wait for RCU to clear its current backlog. > > > > > > The shrinker scan control structure has a node mask in it to > > > indicate what node (and hence CPUs) it should be reclaiming from. > > > This information comes from the main reclaim scan routine, so it > > > would be trivial to feed straight into the RCU code to have it > > > act on just the CPUs/node that we are reclaiming memory from... > > > > For the callbacks, RCU can operate on CPUs, in theory anyway. The > > grace period itself, however, is inherently global. > > > > > > On the NUMA point, it would be dead easy for me to supply a function > > > > that returned the number of callbacks on a given CPU, which would allow > > > > you to similarly evaluate a NUMA node, a cgroup, or whatever. > > > > > > I'd think it runs the other way around - we optimisitically call the > > > RCU layer to do cleanup, and the RCU layer decides if there's enough > > > queued callbacks on the cpus/node to run callbacks immediately. It > > > would even be provided with the scan priority to indicate the level > > > of desperation memory reclaim is under.... > > > > Easy for RCU to count the number of callbacks. That said, it has no > > idea which callbacks are which. Perhaps kfree_rcu() could gather that > > information from the slab allocator, though. > > It's simple to mark slab shrinkers that frees object through RCU and > count freed objects in reclaimer: > > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -536,6 +536,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, > else > new_nr = atomic_long_read(&shrinker->nr_deferred[nid]); > > + if (shrinker->flags & SHRINKER_KFREE_RCU) > + shrinkctl->nr_kfree_rcu += freed; > + > trace_mm_shrink_slab_end(shrinker, nid, freed, nr, new_nr, total_scan); > return freed; > } > > And when accumulated enough do some synchronization. > > Probably it's better to sum freed objects at per-cpu variable, > and accumulate size rather than count. RCU currently has no notion of size outside of possibly kfree_rcu(), so that would be new information to RCU. Thanx, Paul > > > > > All GFP_NOFAIL users should allow direct-reclaim, thus this loop > > > > > in page_alloc shouldn't block RCU and doesn't need special care. > > > > > > > > I must defer to you guys on this. The main caution is the duration of > > > > direct reclaim. After all, if it is too long, the kfree_rcu() instance > > > > would have been better of just invoking synchronize_rcu(). > > > > > > Individual callers of kfree_rcu() have no idea of the load on RCU, > > > nor how long direct reclaim is taking. Calling synchronize_rcu() > > > incorrectly has pretty major downsides to it, so nobody should be > > > trying to expedite kfree_rcu() unless there is a good reason to do > > > so (e.g. at unmount to ensure everything allocated by a filesystem > > > has actually been freed). Hence I'd much prefer the decision to > > > expedite callbacks is made by the RCU subsystem based on it's known > > > callback load and some indication of how close memory reclaim is to > > > declaring OOM... > > > > Sorry, I was unclear. There is a new single-argument kfree_rcu() under > > way that does not require an rcu_head in the structure being freed. > > However, in this case, kfree_rcu() might either allocate the memory > > that is needed to track the memory to be freed on the one hand or just > > invoke synchronize_rcu() on the other. So this decision would be taken > > inside kfree_rcu(), and not be visible to either core RCU or the caller > > of kfree_rcu(). > > > > This decision is made based on whether or not the allocator provides > > kfree_rcu() the memory needed. The tradeoff is what GFP flags are > > supplied. So the question kfree_rcu() has to answer is "Would it be > > better to give myself to reclaim as an additional task, or would it > > instead be better to just invoke synchronize_rcu() and then immediately > > free()?" > > > > I am probably still unclear, but hopefully at least one step in the > > right direction. > > > > Thanx, Paul > > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode 2020-05-13 3:18 ` Paul E. McKenney 2020-05-13 4:35 ` Konstantin Khlebnikov @ 2020-05-13 5:07 ` Dave Chinner 2020-05-13 13:03 ` Paul E. McKenney 1 sibling, 1 reply; 20+ messages in thread From: Dave Chinner @ 2020-05-13 5:07 UTC (permalink / raw) To: Paul E. McKenney Cc: Konstantin Khlebnikov, Johannes Weiner, Andrew Morton, rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, viro On Tue, May 12, 2020 at 08:18:26PM -0700, Paul E. McKenney wrote: > On Wed, May 13, 2020 at 11:32:38AM +1000, Dave Chinner wrote: > > On Sat, May 09, 2020 at 09:09:00AM -0700, Paul E. McKenney wrote: > > > On Sat, May 09, 2020 at 11:54:40AM +0300, Konstantin Khlebnikov wrote: > > > > On 08/05/2020 17.46, Paul E. McKenney wrote: > > > > > Easy for me to provide "start fast and inefficient mode" and "stop fast > > > > > and inefficient mode" APIs for MM to call! > > > > > > > > > > How about rcu_mempressure_start() and rcu_mempressure_end()? I would > > > > > expect them not to nest (as in if you need them to nest, please let > > > > > me know). I would not expect these to be invoked all that often (as in > > > > > if you do need them to be fast and scalable, please let me know). > > > > > > RCU would then be in fast/inefficient mode if either MM told it to be > > > > > or if RCU had detected callback overload on at least one CPU. > > > > > > > > > > Seem reasonable? > > > > > > > > Not exactly nested calls, but kswapd threads are per numa node. > > > > So, at some level nodes under pressure must be counted. > > > > > > Easy enough, especially given that RCU already "counts" CPUs having > > > excessive numbers of callbacks. But assuming that the transitions to/from > > > OOM are rare, I would start by just counting them with a global counter. > > > If the counter is non-zero, RCU is in fast and inefficient mode. > > > > > > > Also forcing rcu calls only for cpus in one numa node might be useful. > > > > > > Interesting. RCU currently evaluates a given CPU by comparing the > > > number of callbacks against a fixed cutoff that can be set at boot using > > > rcutree.qhimark, which defaults to 10,000. When this cutoff is exceeded, > > > RCU becomes more aggressive about invoking callbacks on that CPU, for > > > example, by sacrificing some degree of real-time response. I believe > > > that this heuristic would also serve the OOM use case well. > > > > So one of the things that I'm not sure people have connected here is > > that memory reclaim done by shrinkers is one of the things that > > drives huge numbers of call_rcu() callbacks to free memory via rcu. > > If we are reclaiming dentries and inodes, then we can be pushing > > thousands to hundreds of thousands of objects into kfree_rcu() > > and/or direct call_rcu() calls to free these objects in a single > > reclaim pass. > > Good point! > > > Hence the trigger for RCU going into "excessive callback" mode > > might, in fact, be kswapd running a pass over the shrinkers. i.e. > > memory reclaim itself can be responsible for pushing RCU into this "OOM > > pressure" situation. > > > > So perhaps we've missed a trick here by not having the memory > > reclaim routines trigger RCU callbacks at the end of a priority > > scan. The shrinkers have queued the objects for freeing, but they > > haven't actually been freed yet and so things like slab pages > > haven't actually been returned to the free pool even though the > > shrinkers have said "freed this many objects"... > > > > i.e. perhaps the right solution here is a "rcu_run_callbacks()" > > function that memory reclaim calls before backing off and/or winding > > up reclaim priority. > > It would not be hard to make something that put RCU into fast/inefficient > mode for a couple of grace periods. I will also look into the possibility > of speeding up callback invocation. > > It might also make sense to put RCU grace periods into fast mode while > running the shrinkers that are freeing dentries and inodes. However, > kbuild test robot reports ugly regressions when putting RCU into > fast/inefficient mode to quickly and too often. As in 78.5% degradation > on one of the benchmarks. I don't think it should be dependent on what specific shrinkers free. There are other objects that may be RCU freed by shrinkers, so it really shouldn't be applied just to specific shrinker instances. > > > > I wonder if direct-reclaim should at some stage simply wait for RCU QS. > > > > I.e. call rcu_barrier() or similar somewhere before invoking OOM. > > > > > > The rcu_oom_count() function in the patch starting this thread returns the > > > total number of outstanding callbacks queued on all CPUs. So one approach > > > would be to invoke this function, and if the return value was truly > > > huge (taking size of memory and who knows that all else into account), > > > do the rcu_barrier() to wait for RCU to clear its current backlog. > > > > The shrinker scan control structure has a node mask in it to > > indicate what node (and hence CPUs) it should be reclaiming from. > > This information comes from the main reclaim scan routine, so it > > would be trivial to feed straight into the RCU code to have it > > act on just the CPUs/node that we are reclaiming memory from... > > For the callbacks, RCU can operate on CPUs, in theory anyway. The > grace period itself, however, is inherently global. *nod* The memory reclaim backoffs tend to be in the order of 50-100 milliseconds, though, so we are talking multiple grace periods here, right? In which case, triggering a grace period expiry before a backoff takes place might make a lot sense... > > > On the NUMA point, it would be dead easy for me to supply a function > > > that returned the number of callbacks on a given CPU, which would allow > > > you to similarly evaluate a NUMA node, a cgroup, or whatever. > > > > I'd think it runs the other way around - we optimisitically call the > > RCU layer to do cleanup, and the RCU layer decides if there's enough > > queued callbacks on the cpus/node to run callbacks immediately. It > > would even be provided with the scan priority to indicate the level > > of desperation memory reclaim is under.... > > Easy for RCU to count the number of callbacks. That said, it has no > idea which callbacks are which. Perhaps kfree_rcu() could gather that > information from the slab allocator, though. > > > > > All GFP_NOFAIL users should allow direct-reclaim, thus this loop > > > > in page_alloc shouldn't block RCU and doesn't need special care. > > > > > > I must defer to you guys on this. The main caution is the duration of > > > direct reclaim. After all, if it is too long, the kfree_rcu() instance > > > would have been better of just invoking synchronize_rcu(). > > > > Individual callers of kfree_rcu() have no idea of the load on RCU, > > nor how long direct reclaim is taking. Calling synchronize_rcu() > > incorrectly has pretty major downsides to it, so nobody should be > > trying to expedite kfree_rcu() unless there is a good reason to do > > so (e.g. at unmount to ensure everything allocated by a filesystem > > has actually been freed). Hence I'd much prefer the decision to > > expedite callbacks is made by the RCU subsystem based on it's known > > callback load and some indication of how close memory reclaim is to > > declaring OOM... > > Sorry, I was unclear. There is a new single-argument kfree_rcu() under > way that does not require an rcu_head in the structure being freed. > However, in this case, kfree_rcu() might either allocate the memory > that is needed to track the memory to be freed on the one hand or just > invoke synchronize_rcu() on the other. So this decision would be taken > inside kfree_rcu(), and not be visible to either core RCU or the caller > of kfree_rcu(). Ah. The need to allocate memory to free memory, and with that comes the requirement of a forwards progress guarantee. It's mempools all over again :P Personally, though, designing functionality that specifically requires memory allocation to free memory seems like an incredibly fragile thing to be doing. I don't know the use case here, though, but jsut the general description of what you are trying to do rings alarm bells in my head... > This decision is made based on whether or not the allocator provides > kfree_rcu() the memory needed. The tradeoff is what GFP flags are > supplied. So there's a reclaim recursion problem here, too? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode 2020-05-13 5:07 ` Dave Chinner @ 2020-05-13 13:03 ` Paul E. McKenney 0 siblings, 0 replies; 20+ messages in thread From: Paul E. McKenney @ 2020-05-13 13:03 UTC (permalink / raw) To: Dave Chinner Cc: Konstantin Khlebnikov, Johannes Weiner, Andrew Morton, rcu, linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, viro On Wed, May 13, 2020 at 03:07:26PM +1000, Dave Chinner wrote: > On Tue, May 12, 2020 at 08:18:26PM -0700, Paul E. McKenney wrote: > > On Wed, May 13, 2020 at 11:32:38AM +1000, Dave Chinner wrote: > > > On Sat, May 09, 2020 at 09:09:00AM -0700, Paul E. McKenney wrote: > > > > On Sat, May 09, 2020 at 11:54:40AM +0300, Konstantin Khlebnikov wrote: > > > > > On 08/05/2020 17.46, Paul E. McKenney wrote: > > > > > > Easy for me to provide "start fast and inefficient mode" and "stop fast > > > > > > and inefficient mode" APIs for MM to call! > > > > > > > > > > > > How about rcu_mempressure_start() and rcu_mempressure_end()? I would > > > > > > expect them not to nest (as in if you need them to nest, please let > > > > > > me know). I would not expect these to be invoked all that often (as in > > > > > > if you do need them to be fast and scalable, please let me know). > > > > > > > RCU would then be in fast/inefficient mode if either MM told it to be > > > > > > or if RCU had detected callback overload on at least one CPU. > > > > > > > > > > > > Seem reasonable? > > > > > > > > > > Not exactly nested calls, but kswapd threads are per numa node. > > > > > So, at some level nodes under pressure must be counted. > > > > > > > > Easy enough, especially given that RCU already "counts" CPUs having > > > > excessive numbers of callbacks. But assuming that the transitions to/from > > > > OOM are rare, I would start by just counting them with a global counter. > > > > If the counter is non-zero, RCU is in fast and inefficient mode. > > > > > > > > > Also forcing rcu calls only for cpus in one numa node might be useful. > > > > > > > > Interesting. RCU currently evaluates a given CPU by comparing the > > > > number of callbacks against a fixed cutoff that can be set at boot using > > > > rcutree.qhimark, which defaults to 10,000. When this cutoff is exceeded, > > > > RCU becomes more aggressive about invoking callbacks on that CPU, for > > > > example, by sacrificing some degree of real-time response. I believe > > > > that this heuristic would also serve the OOM use case well. > > > > > > So one of the things that I'm not sure people have connected here is > > > that memory reclaim done by shrinkers is one of the things that > > > drives huge numbers of call_rcu() callbacks to free memory via rcu. > > > If we are reclaiming dentries and inodes, then we can be pushing > > > thousands to hundreds of thousands of objects into kfree_rcu() > > > and/or direct call_rcu() calls to free these objects in a single > > > reclaim pass. > > > > Good point! > > > > > Hence the trigger for RCU going into "excessive callback" mode > > > might, in fact, be kswapd running a pass over the shrinkers. i.e. > > > memory reclaim itself can be responsible for pushing RCU into this "OOM > > > pressure" situation. > > > > > > So perhaps we've missed a trick here by not having the memory > > > reclaim routines trigger RCU callbacks at the end of a priority > > > scan. The shrinkers have queued the objects for freeing, but they > > > haven't actually been freed yet and so things like slab pages > > > haven't actually been returned to the free pool even though the > > > shrinkers have said "freed this many objects"... > > > > > > i.e. perhaps the right solution here is a "rcu_run_callbacks()" > > > function that memory reclaim calls before backing off and/or winding > > > up reclaim priority. > > > > It would not be hard to make something that put RCU into fast/inefficient > > mode for a couple of grace periods. I will also look into the possibility > > of speeding up callback invocation. > > > > It might also make sense to put RCU grace periods into fast mode while > > running the shrinkers that are freeing dentries and inodes. However, > > kbuild test robot reports ugly regressions when putting RCU into > > fast/inefficient mode to quickly and too often. As in 78.5% degradation > > on one of the benchmarks. > > I don't think it should be dependent on what specific shrinkers > free. There are other objects that may be RCU freed by shrinkers, > so it really shouldn't be applied just to specific shrinker > instances. Plus a call_rcu() might be freeing a linked structure, so counting the size of the argument to call_rcu() would be understating the total amount of memory being freed. > > > > > I wonder if direct-reclaim should at some stage simply wait for RCU QS. > > > > > I.e. call rcu_barrier() or similar somewhere before invoking OOM. > > > > > > > > The rcu_oom_count() function in the patch starting this thread returns the > > > > total number of outstanding callbacks queued on all CPUs. So one approach > > > > would be to invoke this function, and if the return value was truly > > > > huge (taking size of memory and who knows that all else into account), > > > > do the rcu_barrier() to wait for RCU to clear its current backlog. > > > > > > The shrinker scan control structure has a node mask in it to > > > indicate what node (and hence CPUs) it should be reclaiming from. > > > This information comes from the main reclaim scan routine, so it > > > would be trivial to feed straight into the RCU code to have it > > > act on just the CPUs/node that we are reclaiming memory from... > > > > For the callbacks, RCU can operate on CPUs, in theory anyway. The > > grace period itself, however, is inherently global. > > *nod* > > The memory reclaim backoffs tend to be in the order of 50-100 > milliseconds, though, so we are talking multiple grace periods here, > right? In which case, triggering a grace period expiry before a > backoff takes place might make a lot sense... Usually, yes, I would expect several grace periods to elapse during a backoff. > > > > On the NUMA point, it would be dead easy for me to supply a function > > > > that returned the number of callbacks on a given CPU, which would allow > > > > you to similarly evaluate a NUMA node, a cgroup, or whatever. > > > > > > I'd think it runs the other way around - we optimisitically call the > > > RCU layer to do cleanup, and the RCU layer decides if there's enough > > > queued callbacks on the cpus/node to run callbacks immediately. It > > > would even be provided with the scan priority to indicate the level > > > of desperation memory reclaim is under.... > > > > Easy for RCU to count the number of callbacks. That said, it has no > > idea which callbacks are which. Perhaps kfree_rcu() could gather that > > information from the slab allocator, though. > > > > > > > All GFP_NOFAIL users should allow direct-reclaim, thus this loop > > > > > in page_alloc shouldn't block RCU and doesn't need special care. > > > > > > > > I must defer to you guys on this. The main caution is the duration of > > > > direct reclaim. After all, if it is too long, the kfree_rcu() instance > > > > would have been better of just invoking synchronize_rcu(). > > > > > > Individual callers of kfree_rcu() have no idea of the load on RCU, > > > nor how long direct reclaim is taking. Calling synchronize_rcu() > > > incorrectly has pretty major downsides to it, so nobody should be > > > trying to expedite kfree_rcu() unless there is a good reason to do > > > so (e.g. at unmount to ensure everything allocated by a filesystem > > > has actually been freed). Hence I'd much prefer the decision to > > > expedite callbacks is made by the RCU subsystem based on it's known > > > callback load and some indication of how close memory reclaim is to > > > declaring OOM... > > > > Sorry, I was unclear. There is a new single-argument kfree_rcu() under > > way that does not require an rcu_head in the structure being freed. > > However, in this case, kfree_rcu() might either allocate the memory > > that is needed to track the memory to be freed on the one hand or just > > invoke synchronize_rcu() on the other. So this decision would be taken > > inside kfree_rcu(), and not be visible to either core RCU or the caller > > of kfree_rcu(). > > Ah. The need to allocate memory to free memory, and with that comes > the requirement of a forwards progress guarantee. It's mempools all > over again :P > > Personally, though, designing functionality that specifically > requires memory allocation to free memory seems like an incredibly > fragile thing to be doing. I don't know the use case here, though, > but jsut the general description of what you are trying to do rings > alarm bells in my head... And mine as well. Hence my earlier insistence that kfree_rcu() never block waiting for memory, but instead just invoke synchronize_rcu() and then immediately free the memory. Others have since convinced me that there are combinations of GFP flags that allow only limited sleeping so as to avoid the OOM deadlocks that I fear. > > This decision is made based on whether or not the allocator provides > > kfree_rcu() the memory needed. The tradeoff is what GFP flags are > > supplied. > > So there's a reclaim recursion problem here, too? There was an earlier discussion as to what was safe, with one recommendation being __GFP_NORETRY. Thanx, Paul ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <20200507093647.11932-1-hdanton@sina.com>]
* Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode [not found] ` <20200507093647.11932-1-hdanton@sina.com> @ 2020-05-07 15:49 ` Paul E. McKenney [not found] ` <20200508133743.9356-1-hdanton@sina.com> 1 sibling, 0 replies; 20+ messages in thread From: Paul E. McKenney @ 2020-05-07 15:49 UTC (permalink / raw) To: Hillf Danton; +Cc: linux-kernel, rcu, akpm, linux-mm On Thu, May 07, 2020 at 05:36:47PM +0800, Hillf Danton wrote: > > Hello Paul > > On Wed, 6 May 2020 17:42:40 Paul E. McKenney wrote: > > > > This commit adds a shrinker so as to inform RCU when memory is scarce. > > A simpler hook is added in the logic of kswapd for subscribing the info > that memory pressure is high, and then on top of it make rcu a subscriber > by copying your code for the shrinker, wishing it makes a sense to you. > > What's not yet included is to make the hook per node to help make every > reviewer convinced that memory is becoming tight. Of course without the > cost of making subscribers node aware. > > Hillf I must defer to the MM folks on the MM portion of this patch, but early warning of impending memory pressure would be extremely good. A few RCU-related notes inline below, though. Thanx, Paul > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -49,6 +49,16 @@ static inline void set_max_mapnr(unsigne > static inline void set_max_mapnr(unsigned long limit) { } > #endif > > +/* subscriber of kswapd's memory_pressure_high signal */ > +struct mph_subscriber { > + struct list_head node; > + void (*info) (void *data); > + void *data; > +}; > + > +int mph_subscribe(struct mph_subscriber *ms); > +void mph_unsubscribe(struct mph_subscriber *ms); > + > extern atomic_long_t _totalram_pages; > static inline unsigned long totalram_pages(void) > { > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -3536,6 +3536,40 @@ static bool kswapd_shrink_node(pg_data_t > } > > /* > + * subscribers of kswapd's signal that memory pressure is high > + */ > +static LIST_HEAD(mph_subs); > +static DEFINE_MUTEX(mph_lock); > + > +int mph_subscribe(struct mph_subscriber *ms) > +{ > + if (!ms->info) > + return -EAGAIN; > + > + mutex_lock(&mph_lock); > + list_add_tail(&ms->node, &mph_subs); > + mutex_unlock(&mph_lock); > + return 0; > +} > + > +void mph_unsubscribe(struct mph_subscriber *ms) > +{ > + mutex_lock(&mph_lock); > + list_del(&ms->node); > + mutex_unlock(&mph_lock); > +} > + > +static void kswapd_bbc_mph(void) > +{ > + struct mph_subscriber *ms; > + > + mutex_lock(&mph_lock); > + list_for_each_entry(ms, &mph_subs, node) > + ms->info(ms->data); > + mutex_unlock(&mph_lock); > +} > + > +/* > * For kswapd, balance_pgdat() will reclaim pages across a node from zones > * that are eligible for use by the caller until at least one zone is > * balanced. > @@ -3663,8 +3697,11 @@ restart: > * If we're getting trouble reclaiming, start doing writepage > * even in laptop mode. > */ > - if (sc.priority < DEF_PRIORITY - 2) > + if (sc.priority < DEF_PRIORITY - 2) { > sc.may_writepage = 1; > + if (sc.priority == DEF_PRIORITY - 3) > + kswapd_bbc_mph(); > + } > > /* Call soft limit reclaim before calling shrink_node. */ > sc.nr_scanned = 0; > --- a/kernel/rcu/tree.h > +++ b/kernel/rcu/tree.h > @@ -325,6 +325,8 @@ struct rcu_state { > int ncpus_snap; /* # CPUs seen last time. */ > u8 cbovld; /* Callback overload now? */ > u8 cbovldnext; /* ^ ^ next time? */ > + u8 mph; /* mm pressure high signal from kswapd */ > + unsigned long mph_end; /* time stamp in jiffies */ > > unsigned long jiffies_force_qs; /* Time at which to invoke */ > /* force_quiescent_state(). */ > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -52,6 +52,7 @@ > #include <linux/kprobes.h> > #include <linux/gfp.h> > #include <linux/oom.h> > +#include <linux/mm.h> > #include <linux/smpboot.h> > #include <linux/jiffies.h> > #include <linux/slab.h> > @@ -2314,8 +2315,15 @@ static void force_qs_rnp(int (*f)(struct > struct rcu_data *rdp; > struct rcu_node *rnp; > > - rcu_state.cbovld = rcu_state.cbovldnext; > + rcu_state.cbovld = smp_load_acquire(&rcu_state.mph) || > + rcu_state.cbovldnext; > rcu_state.cbovldnext = false; > + > + if (READ_ONCE(rcu_state.mph) && > + time_after(jiffies, READ_ONCE(rcu_state.mph_end))) { > + WRITE_ONCE(rcu_state.mph, false); > + pr_info("%s: Ending OOM-mode grace periods.\n", __func__); > + } > rcu_for_each_leaf_node(rnp) { > cond_resched_tasks_rcu_qs(); > mask = 0; > @@ -2643,6 +2651,20 @@ static void check_cb_ovld(struct rcu_dat > raw_spin_unlock_rcu_node(rnp); > } > > +static void rcu_mph_info(void *data) This pointer will always be &rcu_state, so why not ignore the pointer and use "rcu_state" below? RCU grace periods are inherently global, so I don't know of any way for RCU to focus on a given NUMA node. All or nothing. But on the other hand, speeding up RCU grace periods will also help specific NUMA nodes, so I believe that it is all good. > +{ > + struct rcu_state *state = data; > + > + WRITE_ONCE(state->mph_end, jiffies + HZ / 10); > + smp_store_release(&state->mph, true); > + rcu_force_quiescent_state(); > +} > + > +static struct mph_subscriber rcu_mph_subscriber = { > + .info = rcu_mph_info, > + .data = &rcu_state, Then this ".data" entry can be omitted, correct? > +}; > + > /* Helper function for call_rcu() and friends. */ > static void > __call_rcu(struct rcu_head *head, rcu_callback_t func) > @@ -4036,6 +4058,8 @@ void __init rcu_init(void) > qovld_calc = DEFAULT_RCU_QOVLD_MULT * qhimark; > else > qovld_calc = qovld; > + > + mph_subscribe(&rcu_mph_subscriber); > } > > #include "tree_stall.h" > ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <20200508133743.9356-1-hdanton@sina.com>]
* Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode [not found] ` <20200508133743.9356-1-hdanton@sina.com> @ 2020-05-08 14:47 ` Paul E. McKenney 0 siblings, 0 replies; 20+ messages in thread From: Paul E. McKenney @ 2020-05-08 14:47 UTC (permalink / raw) To: Hillf Danton; +Cc: linux-kernel, rcu, akpm, linux-mm On Fri, May 08, 2020 at 09:37:43PM +0800, Hillf Danton wrote: > > On Thu, 7 May 2020 08:49:10 Paul E. McKenney wrote: > > > > > +static void rcu_mph_info(void *data) > > > > This pointer will always be &rcu_state, so why not ignore the pointer > > and use "rcu_state" below? > > > Yes you're right. > > > RCU grace periods are inherently global, so I don't know of any way > > for RCU to focus on a given NUMA node. All or nothing. > > Or is it feasible to expose certain RCU thing to VM, say, with which kswapd > can kick grace period every time the kthreads think it's needed? That way > the work to gauge memory pressure can be off RCU's shoulders. A pair of functions RCU provides is easy for me. ;-) Thanx, Paul > > But on the > > other hand, speeding up RCU grace periods will also help specific > > NUMA nodes, so I believe that it is all good. > > > > > +{ > > > + struct rcu_state *state = data; > > > + > > > + WRITE_ONCE(state->mph_end, jiffies + HZ / 10); > > > + smp_store_release(&state->mph, true); > > > + rcu_force_quiescent_state(); > > > +} > > > + > > > +static struct mph_subscriber rcu_mph_subscriber = { > > > + .info = rcu_mph_info, > > > + .data = &rcu_state, > > > > Then this ".data" entry can be omitted, correct? > > Yes :) > > Hillf > ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2020-05-13 13:03 UTC | newest] Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-05-07 0:42 [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode Paul E. McKenney 2020-05-07 0:55 ` Andrew Morton 2020-05-07 2:45 ` Paul E. McKenney 2020-05-07 17:00 ` Johannes Weiner 2020-05-07 17:09 ` Paul E. McKenney 2020-05-07 17:29 ` Paul E. McKenney 2020-05-07 18:31 ` Johannes Weiner 2020-05-07 19:09 ` Paul E. McKenney 2020-05-08 9:00 ` Konstantin Khlebnikov 2020-05-08 14:46 ` Paul E. McKenney 2020-05-09 8:54 ` Konstantin Khlebnikov 2020-05-09 16:09 ` Paul E. McKenney 2020-05-13 1:32 ` Dave Chinner 2020-05-13 3:18 ` Paul E. McKenney 2020-05-13 4:35 ` Konstantin Khlebnikov 2020-05-13 12:52 ` Paul E. McKenney 2020-05-13 5:07 ` Dave Chinner 2020-05-13 13:03 ` Paul E. McKenney [not found] ` <20200507093647.11932-1-hdanton@sina.com> 2020-05-07 15:49 ` Paul E. McKenney [not found] ` <20200508133743.9356-1-hdanton@sina.com> 2020-05-08 14:47 ` Paul E. McKenney
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).