linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] list_lru: Prefetch neighboring list entries before acquiring lock
@ 2017-11-29 14:17 Waiman Long
  2017-11-29 21:53 ` Andrew Morton
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Waiman Long @ 2017-11-29 14:17 UTC (permalink / raw)
  To: Andrew Morton, Vladimir Davydov, Johannes Weiner, Dave Chinner
  Cc: linux-kernel, linux-mm, Waiman Long

The list_lru_del() function removes the given item from the LRU list.
The operation looks simple, but it involves writing into the cachelines
of the two neighboring list entries in order to get the deletion done.
That can take a while if the cachelines aren't there yet, thus
prolonging the lock hold time.

To reduce the lock hold time, the cachelines of the two neighboring
list entries are now prefetched before acquiring the list_lru_node's
lock.

Using a multi-threaded test program that created a large number
of dentries and then killed them, the execution time was reduced
from 38.5s to 36.6s after applying the patch on a 2-socket 36-core
72-thread x86-64 system.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/list_lru.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index f141f0c..65aae44 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -132,8 +132,16 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
 	struct list_lru_node *nlru = &lru->node[nid];
 	struct list_lru_one *l;
 
+	/*
+	 * Prefetch the neighboring list entries to reduce lock hold time.
+	 */
+	if (unlikely(list_empty(item)))
+		return false;
+	prefetchw(item->prev);
+	prefetchw(item->next);
+
 	spin_lock(&nlru->lock);
-	if (!list_empty(item)) {
+	if (likely(!list_empty(item))) {
 		l = list_lru_from_kmem(nlru, item);
 		list_del_init(item);
 		l->nr_items--;
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] list_lru: Prefetch neighboring list entries before acquiring lock
  2017-11-29 14:17 [PATCH] list_lru: Prefetch neighboring list entries before acquiring lock Waiman Long
@ 2017-11-29 21:53 ` Andrew Morton
  2017-11-30  0:42   ` Dave Chinner
  2017-11-30  0:53 ` Minchan Kim
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2017-11-29 21:53 UTC (permalink / raw)
  To: Waiman Long
  Cc: Vladimir Davydov, Johannes Weiner, Dave Chinner, linux-kernel, linux-mm

On Wed, 29 Nov 2017 09:17:34 -0500 Waiman Long <longman@redhat.com> wrote:

> The list_lru_del() function removes the given item from the LRU list.
> The operation looks simple, but it involves writing into the cachelines
> of the two neighboring list entries in order to get the deletion done.
> That can take a while if the cachelines aren't there yet, thus
> prolonging the lock hold time.
> 
> To reduce the lock hold time, the cachelines of the two neighboring
> list entries are now prefetched before acquiring the list_lru_node's
> lock.
> 
> Using a multi-threaded test program that created a large number
> of dentries and then killed them, the execution time was reduced
> from 38.5s to 36.6s after applying the patch on a 2-socket 36-core
> 72-thread x86-64 system.

Patch looks good.

Can someone (Dave?) please explain why list_lru_del() supports deletion
of an already list_empty(item)?  This seems a rather dangerous thing to
encourage.  Use cases I can think of are:

a) item is already reliably deleted, so why the heck was the caller
   calling list_lru_del() and 

b) item might be concurrently deleted by another thread, in which case
   the race loser is likely to hit a use-after-free.

Is there a good use case here?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] list_lru: Prefetch neighboring list entries before acquiring lock
  2017-11-29 21:53 ` Andrew Morton
@ 2017-11-30  0:42   ` Dave Chinner
  2017-11-30 13:54     ` Waiman Long
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2017-11-30  0:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Waiman Long, Vladimir Davydov, Johannes Weiner, linux-kernel, linux-mm

On Wed, Nov 29, 2017 at 01:53:19PM -0800, Andrew Morton wrote:
> On Wed, 29 Nov 2017 09:17:34 -0500 Waiman Long <longman@redhat.com> wrote:
> 
> > The list_lru_del() function removes the given item from the LRU list.
> > The operation looks simple, but it involves writing into the cachelines
> > of the two neighboring list entries in order to get the deletion done.
> > That can take a while if the cachelines aren't there yet, thus
> > prolonging the lock hold time.
> > 
> > To reduce the lock hold time, the cachelines of the two neighboring
> > list entries are now prefetched before acquiring the list_lru_node's
> > lock.
> > 
> > Using a multi-threaded test program that created a large number
> > of dentries and then killed them, the execution time was reduced
> > from 38.5s to 36.6s after applying the patch on a 2-socket 36-core
> > 72-thread x86-64 system.
> 
> Patch looks good.
> 
> Can someone (Dave?) please explain why list_lru_del() supports deletion
> of an already list_empty(item)?
> This seems a rather dangerous thing to
> encourage.  Use cases I can think of are:
> 
> a) item is already reliably deleted, so why the heck was the caller
>    calling list_lru_del() and 

Higher level operations can race. e.g. caller looks up an object,
finds it on the LRU, takes a reference. Then calls list_lru_del()
to remove it from the LRU. It blocks 'cause it can't get the list
lock as....

... Meanwhile, the list shrinker is running, sees the object on the
LRU list, sees it has a valid reference count, does lazy LRU cleanup
by runnning list_lru_isolate() on the object which removes it from
the LRU list. Eventually it drops the list lock, and ....

... the original thread gets the lock in list_lru_del() and sees the
object has already been removed from the LRU....

IOWs, this sort of boilerplate code is potentially dangerous if
list_lru_del() can't handle items that have already been removed
from the list:

	if (!list_empty(&obj->lru))
		list_lru_del(&obj->lru);

Because this:

	if (!list_empty(&obj->lru))
		<preempt>
		<shrinker removes obj from LRU>
		list_lru_del(&obj->lru);
			<SPLAT>

Would result in bad things happening....

And, from that perspective, the racy shortcut in the proposed patch
is wrong, too. Prefetch is fine, but in general shortcutting list
empty checks outside the internal lock isn't.

> b) item might be concurrently deleted by another thread, in which case
>    the race loser is likely to hit a use-after-free.

Nope. the list_lru infrastructure is just for tracking how the
object is aging. It is not designed to control object lifecycle
behaviour - it's not a kref. It's just an ordered list with a
shrinker callback to allow subsystems to react to memory pressure.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] list_lru: Prefetch neighboring list entries before acquiring lock
  2017-11-29 14:17 [PATCH] list_lru: Prefetch neighboring list entries before acquiring lock Waiman Long
  2017-11-29 21:53 ` Andrew Morton
@ 2017-11-30  0:53 ` Minchan Kim
  2017-11-30 13:43   ` Waiman Long
  2017-11-30 14:34 ` Matthew Wilcox
  2017-12-05 14:49 ` Michal Hocko
  3 siblings, 1 reply; 18+ messages in thread
From: Minchan Kim @ 2017-11-30  0:53 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Vladimir Davydov, Johannes Weiner, Dave Chinner,
	linux-kernel, linux-mm

Hello,

On Wed, Nov 29, 2017 at 09:17:34AM -0500, Waiman Long wrote:
> The list_lru_del() function removes the given item from the LRU list.
> The operation looks simple, but it involves writing into the cachelines
> of the two neighboring list entries in order to get the deletion done.
> That can take a while if the cachelines aren't there yet, thus
> prolonging the lock hold time.
> 
> To reduce the lock hold time, the cachelines of the two neighboring
> list entries are now prefetched before acquiring the list_lru_node's
> lock.
> 
> Using a multi-threaded test program that created a large number
> of dentries and then killed them, the execution time was reduced
> from 38.5s to 36.6s after applying the patch on a 2-socket 36-core
> 72-thread x86-64 system.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  mm/list_lru.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index f141f0c..65aae44 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -132,8 +132,16 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
>  	struct list_lru_node *nlru = &lru->node[nid];
>  	struct list_lru_one *l;
>  
> +	/*
> +	 * Prefetch the neighboring list entries to reduce lock hold time.
> +	 */
> +	if (unlikely(list_empty(item)))
> +		return false;
> +	prefetchw(item->prev);
> +	prefetchw(item->next);
> +

A question:

A few month ago, I had a chance to measure prefetch effect with my testing
workload. For the clarification, it's not list_lru_del but list traverse
stuff so it might be similar.

With my experiment at that time, it was really hard to find best place to
add prefetchw. Sometimes, it was too eariler or late so the effect was
not good, even worse on some cases.

Also, the performance was different with each machine although my testing
machines was just two. ;-)

So my question is what's a rule of thumb to add prefetch command?
Like your code, putting prefetch right before touching?

I'm really wonder what's the rule to make every arch/machines happy
with prefetch.

>  	spin_lock(&nlru->lock);
> -	if (!list_empty(item)) {
> +	if (likely(!list_empty(item))) {
>  		l = list_lru_from_kmem(nlru, item);
>  		list_del_init(item);
>  		l->nr_items--;
> -- 
> 1.8.3.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] list_lru: Prefetch neighboring list entries before acquiring lock
  2017-11-30  0:53 ` Minchan Kim
@ 2017-11-30 13:43   ` Waiman Long
  2017-11-30 23:53     ` Minchan Kim
  0 siblings, 1 reply; 18+ messages in thread
From: Waiman Long @ 2017-11-30 13:43 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Vladimir Davydov, Johannes Weiner, Dave Chinner,
	linux-kernel, linux-mm

On 11/29/2017 07:53 PM, Minchan Kim wrote:
> Hello,
>
> On Wed, Nov 29, 2017 at 09:17:34AM -0500, Waiman Long wrote:
>> The list_lru_del() function removes the given item from the LRU list.
>> The operation looks simple, but it involves writing into the cachelines
>> of the two neighboring list entries in order to get the deletion done.
>> That can take a while if the cachelines aren't there yet, thus
>> prolonging the lock hold time.
>>
>> To reduce the lock hold time, the cachelines of the two neighboring
>> list entries are now prefetched before acquiring the list_lru_node's
>> lock.
>>
>> Using a multi-threaded test program that created a large number
>> of dentries and then killed them, the execution time was reduced
>> from 38.5s to 36.6s after applying the patch on a 2-socket 36-core
>> 72-thread x86-64 system.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>  mm/list_lru.c | 10 +++++++++-
>>  1 file changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/list_lru.c b/mm/list_lru.c
>> index f141f0c..65aae44 100644
>> --- a/mm/list_lru.c
>> +++ b/mm/list_lru.c
>> @@ -132,8 +132,16 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
>>  	struct list_lru_node *nlru = &lru->node[nid];
>>  	struct list_lru_one *l;
>>  
>> +	/*
>> +	 * Prefetch the neighboring list entries to reduce lock hold time.
>> +	 */
>> +	if (unlikely(list_empty(item)))
>> +		return false;
>> +	prefetchw(item->prev);
>> +	prefetchw(item->next);
>> +
> A question:
>
> A few month ago, I had a chance to measure prefetch effect with my testing
> workload. For the clarification, it's not list_lru_del but list traverse
> stuff so it might be similar.
>
> With my experiment at that time, it was really hard to find best place to
> add prefetchw. Sometimes, it was too eariler or late so the effect was
> not good, even worse on some cases.
>
> Also, the performance was different with each machine although my testing
> machines was just two. ;-)
>
> So my question is what's a rule of thumb to add prefetch command?
> Like your code, putting prefetch right before touching?
>
> I'm really wonder what's the rule to make every arch/machines happy
> with prefetch.

I add the prefetchw() before spin_lock() because the latency of the
lockinig operation can be highly variable. There will have high latency
when the lock is contended. With the prefetch, lock hold time will be
reduced. In turn, it helps to reduce the amount of lock contention as
well. If there is no lock contention, the prefetch won't help.

Cheers,
Longman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] list_lru: Prefetch neighboring list entries before acquiring lock
  2017-11-30  0:42   ` Dave Chinner
@ 2017-11-30 13:54     ` Waiman Long
  2017-11-30 20:38       ` Dave Chinner
  2017-11-30 20:47       ` Andrew Morton
  0 siblings, 2 replies; 18+ messages in thread
From: Waiman Long @ 2017-11-30 13:54 UTC (permalink / raw)
  To: Dave Chinner, Andrew Morton
  Cc: Vladimir Davydov, Johannes Weiner, linux-kernel, linux-mm

On 11/29/2017 07:42 PM, Dave Chinner wrote:
> On Wed, Nov 29, 2017 at 01:53:19PM -0800, Andrew Morton wrote:
>> On Wed, 29 Nov 2017 09:17:34 -0500 Waiman Long <longman@redhat.com> wrote:
>>
>>> The list_lru_del() function removes the given item from the LRU list.
>>> The operation looks simple, but it involves writing into the cachelines
>>> of the two neighboring list entries in order to get the deletion done.
>>> That can take a while if the cachelines aren't there yet, thus
>>> prolonging the lock hold time.
>>>
>>> To reduce the lock hold time, the cachelines of the two neighboring
>>> list entries are now prefetched before acquiring the list_lru_node's
>>> lock.
>>>
>>> Using a multi-threaded test program that created a large number
>>> of dentries and then killed them, the execution time was reduced
>>> from 38.5s to 36.6s after applying the patch on a 2-socket 36-core
>>> 72-thread x86-64 system.
>> Patch looks good.
>>
>> Can someone (Dave?) please explain why list_lru_del() supports deletion
>> of an already list_empty(item)?
>> This seems a rather dangerous thing to
>> encourage.  Use cases I can think of are:
>>
>> a) item is already reliably deleted, so why the heck was the caller
>>    calling list_lru_del() and 
> Higher level operations can race. e.g. caller looks up an object,
> finds it on the LRU, takes a reference. Then calls list_lru_del()
> to remove it from the LRU. It blocks 'cause it can't get the list
> lock as....
>
> ... Meanwhile, the list shrinker is running, sees the object on the
> LRU list, sees it has a valid reference count, does lazy LRU cleanup
> by runnning list_lru_isolate() on the object which removes it from
> the LRU list. Eventually it drops the list lock, and ....
>
> ... the original thread gets the lock in list_lru_del() and sees the
> object has already been removed from the LRU....
>
> IOWs, this sort of boilerplate code is potentially dangerous if
> list_lru_del() can't handle items that have already been removed
> from the list:
>
> 	if (!list_empty(&obj->lru))
> 		list_lru_del(&obj->lru);
>
> Because this:
>
> 	if (!list_empty(&obj->lru))
> 		<preempt>
> 		<shrinker removes obj from LRU>
> 		list_lru_del(&obj->lru);
> 			<SPLAT>
>
> Would result in bad things happening....
>
> And, from that perspective, the racy shortcut in the proposed patch
> is wrong, too. Prefetch is fine, but in general shortcutting list
> empty checks outside the internal lock isn't.

For the record, I add one more list_empty() check at the beginning of
list_lru_del() in the patch for 2 purpose:
1. it allows the code to bail out early.
2. It make sure the cacheline of the list_head entry itself is loaded.

Other than that, I only add a likely() qualifier to the existing
list_empty() check within the lock critical region.

Cheers,
Longman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] list_lru: Prefetch neighboring list entries before acquiring lock
  2017-11-29 14:17 [PATCH] list_lru: Prefetch neighboring list entries before acquiring lock Waiman Long
  2017-11-29 21:53 ` Andrew Morton
  2017-11-30  0:53 ` Minchan Kim
@ 2017-11-30 14:34 ` Matthew Wilcox
  2017-12-05 14:49 ` Michal Hocko
  3 siblings, 0 replies; 18+ messages in thread
From: Matthew Wilcox @ 2017-11-30 14:34 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Vladimir Davydov, Johannes Weiner, Dave Chinner,
	linux-kernel, linux-mm

On Wed, Nov 29, 2017 at 09:17:34AM -0500, Waiman Long wrote:
> The list_lru_del() function removes the given item from the LRU list.
> The operation looks simple, but it involves writing into the cachelines
> of the two neighboring list entries in order to get the deletion done.
> That can take a while if the cachelines aren't there yet, thus
> prolonging the lock hold time.
> 
> To reduce the lock hold time, the cachelines of the two neighboring
> list entries are now prefetched before acquiring the list_lru_node's
> lock.
> 
> Using a multi-threaded test program that created a large number
> of dentries and then killed them, the execution time was reduced
> from 38.5s to 36.6s after applying the patch on a 2-socket 36-core
> 72-thread x86-64 system.

FWIW, I've been thinking about a new queue data structure based on
the xarray.  It'd avoid this cache miss problem altogether (deleting an
entry from the queue usually touches one cacheline).  I was specifically
thinking about list_lru when designing it because I'd love to reclaim the
two pointers in the radix_tree_node for use as a fourth tag bit.

There's no code behind this yet, but I wrote some documentation for it as
I was designing it in my head:


======
XQueue
======

Overview
========

The XQueue is an abstract data type which provides a low-overhead queue.
It supports up to two billion entries per queue and has a memory overhead
approximately 40% lower than the doubly-linked list approach.  It does
not require embedding any information in the object, allowing objects
to be on multiple queues simultaneously.  However, it may require memory
allocation in order to add an element to the queue.

Objects can be removed from arbitrary positions in the XQueue, but
insertion of new objects at arbitrary positions in the XQueue is not
possible.  Objects in the XQueue may be replaced with other objects.

How to use it
=============

Initialise your XQueue by defining it using :c:func:`DEFINE_XQUEUE` or
dynamically allocating it and calling :c:func:`xq_init`.

Most users can then call :c:func:`xq_push` to add a new object to the queue
and :c:func:`xq_pop` to remove the object at the front of the queue.

If you need to be able to remove an object from an arbitrary position
in the queue, embed the queue ID returned from :c:func:`xq_push` in your
object, and use :c:func:`xq_remove`.

Some users need to hold a lock of their own while pushing an object onto
the queue.  These users can either use GFP_ATOMIC in the gfp flags, or
(preferably) call :c:func:`xq_reserve` to reserve a queue ID for their
object before taking their lock, and placing the object into the queue
using :c:func:`xq_replace`.

Some users want to use the queue ID of the object for an external purpose
(eg use it as the ID of a network packet).  These users can constrain
the range of the ID allocated by calling :c:func:`xq_push_range` or
:c:func:`xq_reserve_range` instead.

If using the reserve/replace technique, the queue becomes not quite pure.
For example, the following situation can happen with two producers (B & C)
and one consumer (A):

B: Push queue ID 2
C: Reserve queue ID 3
B: Push queue ID 4
B: Push queue ID 5
A: Pop queue ID 2
A: Pop queue ID 4
C: Replace queue ID 3
A: Pop queue ID 3
A: Pop queue ID 5

C's object has left the queue between B's two objects.  This is not
normally a concern to most users, but if a producer puts a fence entry
on the queue, it is permeable to another entry which had an ID reserved
before the fence was enqueued, but not actually used until after the fence
was enqueued.

Non-queue usages
================

You may wish to use the XQueue as an ID allocator.  This is a perfectly
good use for the XQueue, and it is superior to the IDR in some ways.

For example, suppose you are allocating request IDs.  Using the IDR for a
cyclic allocation would lead to very deep trees being created and a lot
of memory being spent on internal structures.  The XQueue will compact
the trees that it uses and attempt to keep memory usage to a minimum.

In this scenario, you would almost never call xq_pop.  You'd call xq_push to
assign an ID

Functions and structures
========================

.. kernel-doc:: include/linux/xqueue.h
.. kernel-doc:: lib/xqueue.c


I also wrote xqueue.h:


/*
 * eXtensible Queues
 * Copyright (c) 2017 Microsoft Corporation <mawilcox@microsoft.com>
 *
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License as
 * published by the Free Software Foundation; either version 2 of the
 * License, or (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * eXtensible Queues provide queues which are more cache-efficient than
 * using list.h
 */

#ifndef _LINUX_XQUEUE_H
#define _LINUX_XQUEUE_H

#include <linux/xarray.h>

struct xqueue {
	struct xarray xq_xa;
	unsigned int xq_last;
	unsigned int xq_base;
}

#define XQ_INIT_FLAGS	(XA_FLAGS_TRACK_FREE | XA_FLAGS_TAG(0))

#define XQUEUE_INIT(name) {					\
	.xq_xa = __XARRAY_INIT(name, XQ_INIT_FLAGS),		\
	.xq_last = 0,						\
	.xq_base = 0,						\
}


#define DEFINE_XQUEUE(name) struct xqueue name = XQUEUE_INIT(name)

static inline void xq_init(struct xqueue *xq)
{
	xa_init(&xq->xq_xa);
	xq_last = 0;
	xq_base = 0;
}

int xq_push_range(struct xqueue *, void *entry, unsigned int min,
			unsigned int max, gfp_t);
void *xq_pop(struct xqueue *);
int xq_reserve_range(struct xqueue *, unsigned int min, unsigned int max,
			gfp_t);

static int xq_push(struct xqueue *xq, void *entry, gfp_t gfp)
{
	return xq_push_range(xq, entry, 0, INT_MAX, gfp);
}

static int xq_reserve(struct xqueue *xq, gfp_t gfp)
{
	return xq_reserve_range(xq, 0, INT_MAX, gfp);
}

/**
 * xq_replace() - Replace an entry in the queue.
 * @xq: The XQueue.
 * @id: Previously allocated queue ID.
 * @entry: The new object to insert into the queue.
 *
 * Normally used as part of the reserve/replace pattern, this function
 * can be used to replace any object still in the queue.
 */
static void xq_replace(struct xqueue *xq, unsigned int id, void *entry)
{
	xa_store(&xq->xq_xa, id, entry, 0);
}

static inline int xq_remove(struct xqueue *xq, unsigned int id)
{
	if (xa_store(&xq->xq_xa, id, NULL, 0))
		return 0;
	return -ENOENT;
}

static inline bool xq_empty(const struct xqueue *xq)
{
	return xa_empty(&xq->xq_xa);
}

#endif /* _LINUX_XQUEUE_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] list_lru: Prefetch neighboring list entries before acquiring lock
  2017-11-30 13:54     ` Waiman Long
@ 2017-11-30 20:38       ` Dave Chinner
  2017-11-30 20:55         ` Waiman Long
  2017-11-30 20:47       ` Andrew Morton
  1 sibling, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2017-11-30 20:38 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Vladimir Davydov, Johannes Weiner, linux-kernel, linux-mm

On Thu, Nov 30, 2017 at 08:54:04AM -0500, Waiman Long wrote:
> On 11/29/2017 07:42 PM, Dave Chinner wrote:
> > On Wed, Nov 29, 2017 at 01:53:19PM -0800, Andrew Morton wrote:
> >> On Wed, 29 Nov 2017 09:17:34 -0500 Waiman Long <longman@redhat.com> wrote:
> >>
> >>> The list_lru_del() function removes the given item from the LRU list.
> >>> The operation looks simple, but it involves writing into the cachelines
> >>> of the two neighboring list entries in order to get the deletion done.
> >>> That can take a while if the cachelines aren't there yet, thus
> >>> prolonging the lock hold time.
> >>>
> >>> To reduce the lock hold time, the cachelines of the two neighboring
> >>> list entries are now prefetched before acquiring the list_lru_node's
> >>> lock.
> >>>
> >>> Using a multi-threaded test program that created a large number
> >>> of dentries and then killed them, the execution time was reduced
> >>> from 38.5s to 36.6s after applying the patch on a 2-socket 36-core
> >>> 72-thread x86-64 system.
> >> Patch looks good.
> >>
> >> Can someone (Dave?) please explain why list_lru_del() supports deletion
> >> of an already list_empty(item)?
> >> This seems a rather dangerous thing to
> >> encourage.  Use cases I can think of are:
> >>
> >> a) item is already reliably deleted, so why the heck was the caller
> >>    calling list_lru_del() and 
> > Higher level operations can race. e.g. caller looks up an object,
> > finds it on the LRU, takes a reference. Then calls list_lru_del()
> > to remove it from the LRU. It blocks 'cause it can't get the list
> > lock as....
> >
> > ... Meanwhile, the list shrinker is running, sees the object on the
> > LRU list, sees it has a valid reference count, does lazy LRU cleanup
> > by runnning list_lru_isolate() on the object which removes it from
> > the LRU list. Eventually it drops the list lock, and ....
> >
> > ... the original thread gets the lock in list_lru_del() and sees the
> > object has already been removed from the LRU....
> >
> > IOWs, this sort of boilerplate code is potentially dangerous if
> > list_lru_del() can't handle items that have already been removed
> > from the list:
> >
> > 	if (!list_empty(&obj->lru))
> > 		list_lru_del(&obj->lru);
> >
> > Because this:
> >
> > 	if (!list_empty(&obj->lru))
> > 		<preempt>
> > 		<shrinker removes obj from LRU>
> > 		list_lru_del(&obj->lru);
> > 			<SPLAT>
> >
> > Would result in bad things happening....
> >
> > And, from that perspective, the racy shortcut in the proposed patch
> > is wrong, too. Prefetch is fine, but in general shortcutting list
> > empty checks outside the internal lock isn't.
> 
> For the record, I add one more list_empty() check at the beginning of
> list_lru_del() in the patch for 2 purpose:
> 1. it allows the code to bail out early.

Which is what I said was wrong. You haven't addressed why you think
it's safe to add racy specualtive checks to this code in your quest
for speed.

Also, I'm curious about is how much of the gain is from the
prefetching, and how much of the gain is from avoiding the lock
altogether by the early bailout...

> 2. It make sure the cacheline of the list_head entry itself is loaded.
> 
> Other than that, I only add a likely() qualifier to the existing
> list_empty() check within the lock critical region.

Yup, but in many cases programmers get the static branch prediction
hints are wrong. In this case, you are supposing that nobody ever
calls list_lru_del() on objects that aren't on the lru. That's not
true - inodes that are being evicted may never have been on the LRU
at all, but we still call through list_lru_del() so it can determine
the LRU state correctly (e.g. cache cold rm -rf workloads)....

IOWs, I'm pretty sure even just adding static branch prediction
hints here is wrong....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] list_lru: Prefetch neighboring list entries before acquiring lock
  2017-11-30 13:54     ` Waiman Long
  2017-11-30 20:38       ` Dave Chinner
@ 2017-11-30 20:47       ` Andrew Morton
  2017-11-30 20:49         ` Waiman Long
  2017-12-01  0:09         ` Minchan Kim
  1 sibling, 2 replies; 18+ messages in thread
From: Andrew Morton @ 2017-11-30 20:47 UTC (permalink / raw)
  To: Waiman Long
  Cc: Dave Chinner, Vladimir Davydov, Johannes Weiner, linux-kernel, linux-mm

On Thu, 30 Nov 2017 08:54:04 -0500 Waiman Long <longman@redhat.com> wrote:

> > And, from that perspective, the racy shortcut in the proposed patch
> > is wrong, too. Prefetch is fine, but in general shortcutting list
> > empty checks outside the internal lock isn't.
> 
> For the record, I add one more list_empty() check at the beginning of
> list_lru_del() in the patch for 2 purpose:
> 1. it allows the code to bail out early.
> 2. It make sure the cacheline of the list_head entry itself is loaded.
> 
> Other than that, I only add a likely() qualifier to the existing
> list_empty() check within the lock critical region.

But it sounds like Dave thinks that unlocked check should be removed?

How does this adendum look?

From: Andrew Morton <akpm@linux-foundation.org>
Subject: list_lru-prefetch-neighboring-list-entries-before-acquiring-lock-fix

include prefetch.h, remove unlocked list_empty() test, per Dave

Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/list_lru.c |    5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff -puN mm/list_lru.c~list_lru-prefetch-neighboring-list-entries-before-acquiring-lock-fix mm/list_lru.c
--- a/mm/list_lru.c~list_lru-prefetch-neighboring-list-entries-before-acquiring-lock-fix
+++ a/mm/list_lru.c
@@ -8,6 +8,7 @@
 #include <linux/module.h>
 #include <linux/mm.h>
 #include <linux/list_lru.h>
+#include <linux/prefetch.h>
 #include <linux/slab.h>
 #include <linux/mutex.h>
 #include <linux/memcontrol.h>
@@ -135,13 +136,11 @@ bool list_lru_del(struct list_lru *lru,
 	/*
 	 * Prefetch the neighboring list entries to reduce lock hold time.
 	 */
-	if (unlikely(list_empty(item)))
-		return false;
 	prefetchw(item->prev);
 	prefetchw(item->next);
 
 	spin_lock(&nlru->lock);
-	if (likely(!list_empty(item))) {
+	if (!list_empty(item)) {
 		l = list_lru_from_kmem(nlru, item);
 		list_del_init(item);
 		l->nr_items--;
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] list_lru: Prefetch neighboring list entries before acquiring lock
  2017-11-30 20:47       ` Andrew Morton
@ 2017-11-30 20:49         ` Waiman Long
  2017-12-01  0:09         ` Minchan Kim
  1 sibling, 0 replies; 18+ messages in thread
From: Waiman Long @ 2017-11-30 20:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Vladimir Davydov, Johannes Weiner, linux-kernel, linux-mm

On 11/30/2017 03:47 PM, Andrew Morton wrote:
> On Thu, 30 Nov 2017 08:54:04 -0500 Waiman Long <longman@redhat.com> wrote:
>
>>> And, from that perspective, the racy shortcut in the proposed patch
>>> is wrong, too. Prefetch is fine, but in general shortcutting list
>>> empty checks outside the internal lock isn't.
>> For the record, I add one more list_empty() check at the beginning of
>> list_lru_del() in the patch for 2 purpose:
>> 1. it allows the code to bail out early.
>> 2. It make sure the cacheline of the list_head entry itself is loaded.
>>
>> Other than that, I only add a likely() qualifier to the existing
>> list_empty() check within the lock critical region.
> But it sounds like Dave thinks that unlocked check should be removed?
>
> How does this adendum look?
>
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: list_lru-prefetch-neighboring-list-entries-before-acquiring-lock-fix
>
> include prefetch.h, remove unlocked list_empty() test, per Dave
>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Waiman Long <longman@redhat.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>
>  mm/list_lru.c |    5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff -puN mm/list_lru.c~list_lru-prefetch-neighboring-list-entries-before-acquiring-lock-fix mm/list_lru.c
> --- a/mm/list_lru.c~list_lru-prefetch-neighboring-list-entries-before-acquiring-lock-fix
> +++ a/mm/list_lru.c
> @@ -8,6 +8,7 @@
>  #include <linux/module.h>
>  #include <linux/mm.h>
>  #include <linux/list_lru.h>
> +#include <linux/prefetch.h>
>  #include <linux/slab.h>
>  #include <linux/mutex.h>
>  #include <linux/memcontrol.h>
> @@ -135,13 +136,11 @@ bool list_lru_del(struct list_lru *lru,
>  	/*
>  	 * Prefetch the neighboring list entries to reduce lock hold time.
>  	 */
> -	if (unlikely(list_empty(item)))
> -		return false;
>  	prefetchw(item->prev);
>  	prefetchw(item->next);
>  
>  	spin_lock(&nlru->lock);
> -	if (likely(!list_empty(item))) {
> +	if (!list_empty(item)) {
>  		l = list_lru_from_kmem(nlru, item);
>  		list_del_init(item);
>  		l->nr_items--;
> _
>
Yes, that look good to me.

Thanks for fixing that.

Cheers,
Longman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] list_lru: Prefetch neighboring list entries before acquiring lock
  2017-11-30 20:38       ` Dave Chinner
@ 2017-11-30 20:55         ` Waiman Long
  0 siblings, 0 replies; 18+ messages in thread
From: Waiman Long @ 2017-11-30 20:55 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Vladimir Davydov, Johannes Weiner, linux-kernel, linux-mm

On 11/30/2017 03:38 PM, Dave Chinner wrote:
> On Thu, Nov 30, 2017 at 08:54:04AM -0500, Waiman Long wrote:
>>
>> For the record, I add one more list_empty() check at the beginning of
>> list_lru_del() in the patch for 2 purpose:
>> 1. it allows the code to bail out early.
> Which is what I said was wrong. You haven't addressed why you think
> it's safe to add racy specualtive checks to this code in your quest
> for speed.
>
> Also, I'm curious about is how much of the gain is from the
> prefetching, and how much of the gain is from avoiding the lock
> altogether by the early bailout...

The early bailout doesn't improve the test at all. In the case of
dentries, there is a flag that indicates that the dentry is in the LRU
list. So list_lru_del is only called when it is in the LRU list.

>> 2. It make sure the cacheline of the list_head entry itself is loaded.
>>
>> Other than that, I only add a likely() qualifier to the existing
>> list_empty() check within the lock critical region.
> Yup, but in many cases programmers get the static branch prediction
> hints are wrong. In this case, you are supposing that nobody ever
> calls list_lru_del() on objects that aren't on the lru. That's not
> true - inodes that are being evicted may never have been on the LRU
> at all, but we still call through list_lru_del() so it can determine
> the LRU state correctly (e.g. cache cold rm -rf workloads)....
>
> IOWs, I'm pretty sure even just adding static branch prediction
> hints here is wrong....

In the case of dentries, the static branch is right. However it may not
be true for other users of list_lru, so I am OK to take them out. Thanks
for the explanation.

Cheers,
Longman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] list_lru: Prefetch neighboring list entries before acquiring lock
  2017-11-30 13:43   ` Waiman Long
@ 2017-11-30 23:53     ` Minchan Kim
  0 siblings, 0 replies; 18+ messages in thread
From: Minchan Kim @ 2017-11-30 23:53 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Vladimir Davydov, Johannes Weiner, Dave Chinner,
	linux-kernel, linux-mm

On Thu, Nov 30, 2017 at 08:43:41AM -0500, Waiman Long wrote:
> On 11/29/2017 07:53 PM, Minchan Kim wrote:
> > Hello,
> >
> > On Wed, Nov 29, 2017 at 09:17:34AM -0500, Waiman Long wrote:
> >> The list_lru_del() function removes the given item from the LRU list.
> >> The operation looks simple, but it involves writing into the cachelines
> >> of the two neighboring list entries in order to get the deletion done.
> >> That can take a while if the cachelines aren't there yet, thus
> >> prolonging the lock hold time.
> >>
> >> To reduce the lock hold time, the cachelines of the two neighboring
> >> list entries are now prefetched before acquiring the list_lru_node's
> >> lock.
> >>
> >> Using a multi-threaded test program that created a large number
> >> of dentries and then killed them, the execution time was reduced
> >> from 38.5s to 36.6s after applying the patch on a 2-socket 36-core
> >> 72-thread x86-64 system.
> >>
> >> Signed-off-by: Waiman Long <longman@redhat.com>
> >> ---
> >>  mm/list_lru.c | 10 +++++++++-
> >>  1 file changed, 9 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/mm/list_lru.c b/mm/list_lru.c
> >> index f141f0c..65aae44 100644
> >> --- a/mm/list_lru.c
> >> +++ b/mm/list_lru.c
> >> @@ -132,8 +132,16 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
> >>  	struct list_lru_node *nlru = &lru->node[nid];
> >>  	struct list_lru_one *l;
> >>  
> >> +	/*
> >> +	 * Prefetch the neighboring list entries to reduce lock hold time.
> >> +	 */
> >> +	if (unlikely(list_empty(item)))
> >> +		return false;
> >> +	prefetchw(item->prev);
> >> +	prefetchw(item->next);
> >> +
> > A question:
> >
> > A few month ago, I had a chance to measure prefetch effect with my testing
> > workload. For the clarification, it's not list_lru_del but list traverse
> > stuff so it might be similar.
> >
> > With my experiment at that time, it was really hard to find best place to
> > add prefetchw. Sometimes, it was too eariler or late so the effect was
> > not good, even worse on some cases.
> >
> > Also, the performance was different with each machine although my testing
> > machines was just two. ;-)
> >
> > So my question is what's a rule of thumb to add prefetch command?
> > Like your code, putting prefetch right before touching?
> >
> > I'm really wonder what's the rule to make every arch/machines happy
> > with prefetch.
> 
> I add the prefetchw() before spin_lock() because the latency of the
> lockinig operation can be highly variable. There will have high latency
> when the lock is contended. With the prefetch, lock hold time will be
> reduced. In turn, it helps to reduce the amount of lock contention as
> well. If there is no lock contention, the prefetch won't help.

I knew it by your description. My point is prefetch optimization could
show different results by various architectures and workloads so
I wanted to know what kinds of rule we have to prove it's always win
or no harmful for *everycase* in geneal.

This is a performance patch and it's very micro-optimized topic so
I think we need more data to prove it. Maybe perf is best friend and need a
experiment with no lock contention case, at least.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] list_lru: Prefetch neighboring list entries before acquiring lock
  2017-11-30 20:47       ` Andrew Morton
  2017-11-30 20:49         ` Waiman Long
@ 2017-12-01  0:09         ` Minchan Kim
  2017-12-01 14:14           ` Waiman Long
  1 sibling, 1 reply; 18+ messages in thread
From: Minchan Kim @ 2017-12-01  0:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Waiman Long, Dave Chinner, Vladimir Davydov, Johannes Weiner,
	linux-kernel, linux-mm

On Thu, Nov 30, 2017 at 12:47:36PM -0800, Andrew Morton wrote:
> On Thu, 30 Nov 2017 08:54:04 -0500 Waiman Long <longman@redhat.com> wrote:
> 
> > > And, from that perspective, the racy shortcut in the proposed patch
> > > is wrong, too. Prefetch is fine, but in general shortcutting list
> > > empty checks outside the internal lock isn't.
> > 
> > For the record, I add one more list_empty() check at the beginning of
> > list_lru_del() in the patch for 2 purpose:
> > 1. it allows the code to bail out early.
> > 2. It make sure the cacheline of the list_head entry itself is loaded.
> > 
> > Other than that, I only add a likely() qualifier to the existing
> > list_empty() check within the lock critical region.
> 
> But it sounds like Dave thinks that unlocked check should be removed?
> 
> How does this adendum look?
> 
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: list_lru-prefetch-neighboring-list-entries-before-acquiring-lock-fix
> 
> include prefetch.h, remove unlocked list_empty() test, per Dave
> 
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Waiman Long <longman@redhat.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  mm/list_lru.c |    5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff -puN mm/list_lru.c~list_lru-prefetch-neighboring-list-entries-before-acquiring-lock-fix mm/list_lru.c
> --- a/mm/list_lru.c~list_lru-prefetch-neighboring-list-entries-before-acquiring-lock-fix
> +++ a/mm/list_lru.c
> @@ -8,6 +8,7 @@
>  #include <linux/module.h>
>  #include <linux/mm.h>
>  #include <linux/list_lru.h>
> +#include <linux/prefetch.h>
>  #include <linux/slab.h>
>  #include <linux/mutex.h>
>  #include <linux/memcontrol.h>
> @@ -135,13 +136,11 @@ bool list_lru_del(struct list_lru *lru,
>  	/*
>  	 * Prefetch the neighboring list entries to reduce lock hold time.
>  	 */
> -	if (unlikely(list_empty(item)))
> -		return false;
>  	prefetchw(item->prev);
>  	prefetchw(item->next);
>  
>  	spin_lock(&nlru->lock);
> -	if (likely(!list_empty(item))) {
> +	if (!list_empty(item)) {
>  		l = list_lru_from_kmem(nlru, item);
>  		list_del_init(item);
>  		l->nr_items--;

If we cannot guarantee it's likely !list_empty, prefetch with NULL pointer
would be harmful by the lesson we have learned.

        https://lwn.net/Articles/444336/

So, with considering list_lru_del is generic library, it cannot see
whether a workload makes heavy lock contentions or not.
Maybe, right place for prefetching would be in caller, not in library
itself.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] list_lru: Prefetch neighboring list entries before acquiring lock
  2017-12-01  0:09         ` Minchan Kim
@ 2017-12-01 14:14           ` Waiman Long
  2017-12-01 22:02             ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Waiman Long @ 2017-12-01 14:14 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: Dave Chinner, Vladimir Davydov, Johannes Weiner, linux-kernel, linux-mm

On 11/30/2017 07:09 PM, Minchan Kim wrote:
> On Thu, Nov 30, 2017 at 12:47:36PM -0800, Andrew Morton wrote:
>> On Thu, 30 Nov 2017 08:54:04 -0500 Waiman Long <longman@redhat.com> wrote:
>>
>>>> And, from that perspective, the racy shortcut in the proposed patch
>>>> is wrong, too. Prefetch is fine, but in general shortcutting list
>>>> empty checks outside the internal lock isn't.
>>> For the record, I add one more list_empty() check at the beginning of
>>> list_lru_del() in the patch for 2 purpose:
>>> 1. it allows the code to bail out early.
>>> 2. It make sure the cacheline of the list_head entry itself is loaded.
>>>
>>> Other than that, I only add a likely() qualifier to the existing
>>> list_empty() check within the lock critical region.
>> But it sounds like Dave thinks that unlocked check should be removed?
>>
>> How does this adendum look?
>>
>> From: Andrew Morton <akpm@linux-foundation.org>
>> Subject: list_lru-prefetch-neighboring-list-entries-before-acquiring-lock-fix
>>
>> include prefetch.h, remove unlocked list_empty() test, per Dave
>>
>> Cc: Dave Chinner <david@fromorbit.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
>> Cc: Waiman Long <longman@redhat.com>
>> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>> ---
>>
>>  mm/list_lru.c |    5 ++---
>>  1 file changed, 2 insertions(+), 3 deletions(-)
>>
>> diff -puN mm/list_lru.c~list_lru-prefetch-neighboring-list-entries-before-acquiring-lock-fix mm/list_lru.c
>> --- a/mm/list_lru.c~list_lru-prefetch-neighboring-list-entries-before-acquiring-lock-fix
>> +++ a/mm/list_lru.c
>> @@ -8,6 +8,7 @@
>>  #include <linux/module.h>
>>  #include <linux/mm.h>
>>  #include <linux/list_lru.h>
>> +#include <linux/prefetch.h>
>>  #include <linux/slab.h>
>>  #include <linux/mutex.h>
>>  #include <linux/memcontrol.h>
>> @@ -135,13 +136,11 @@ bool list_lru_del(struct list_lru *lru,
>>  	/*
>>  	 * Prefetch the neighboring list entries to reduce lock hold time.
>>  	 */
>> -	if (unlikely(list_empty(item)))
>> -		return false;
>>  	prefetchw(item->prev);
>>  	prefetchw(item->next);
>>  
>>  	spin_lock(&nlru->lock);
>> -	if (likely(!list_empty(item))) {
>> +	if (!list_empty(item)) {
>>  		l = list_lru_from_kmem(nlru, item);
>>  		list_del_init(item);
>>  		l->nr_items--;
> If we cannot guarantee it's likely !list_empty, prefetch with NULL pointer
> would be harmful by the lesson we have learned.
>
>         https://lwn.net/Articles/444336/

FYI, when list_empty() is true, it just mean the links are pointing to
list entry itself. The pointers will never be NULL. So that won't cause
the NULL prefetch problem mentioned in the article.

> So, with considering list_lru_del is generic library, it cannot see
> whether a workload makes heavy lock contentions or not.
> Maybe, right place for prefetching would be in caller, not in library
> itself.

Yes, the prefetch operations will add some overhead to the whole
deletion operation when the lock isn't contended, but that is usually
rather small compared with the atomic ops involved in the locking
operation itself. On the other hand, the performance gain will be
noticeable when the lock is contended. I will ran some performance
measurement and report the results later.

Cheers,
Longman


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] list_lru: Prefetch neighboring list entries before acquiring lock
  2017-12-01 14:14           ` Waiman Long
@ 2017-12-01 22:02             ` Dave Chinner
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2017-12-01 22:02 UTC (permalink / raw)
  To: Waiman Long
  Cc: Minchan Kim, Andrew Morton, Vladimir Davydov, Johannes Weiner,
	linux-kernel, linux-mm

On Fri, Dec 01, 2017 at 09:14:52AM -0500, Waiman Long wrote:
> On 11/30/2017 07:09 PM, Minchan Kim wrote:
> > On Thu, Nov 30, 2017 at 12:47:36PM -0800, Andrew Morton wrote:
> >> On Thu, 30 Nov 2017 08:54:04 -0500 Waiman Long <longman@redhat.com> wrote:
> >>
> >>>> And, from that perspective, the racy shortcut in the proposed patch
> >>>> is wrong, too. Prefetch is fine, but in general shortcutting list
> >>>> empty checks outside the internal lock isn't.
> >>> For the record, I add one more list_empty() check at the beginning of
> >>> list_lru_del() in the patch for 2 purpose:
> >>> 1. it allows the code to bail out early.
> >>> 2. It make sure the cacheline of the list_head entry itself is loaded.
> >>>
> >>> Other than that, I only add a likely() qualifier to the existing
> >>> list_empty() check within the lock critical region.
> >> But it sounds like Dave thinks that unlocked check should be removed?
> >>
> >> How does this adendum look?
> >>
> >> From: Andrew Morton <akpm@linux-foundation.org>
> >> Subject: list_lru-prefetch-neighboring-list-entries-before-acquiring-lock-fix
> >>
> >> include prefetch.h, remove unlocked list_empty() test, per Dave
> >>
> >> Cc: Dave Chinner <david@fromorbit.com>
> >> Cc: Johannes Weiner <hannes@cmpxchg.org>
> >> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> >> Cc: Waiman Long <longman@redhat.com>
> >> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> >> ---
> >>
> >>  mm/list_lru.c |    5 ++---
> >>  1 file changed, 2 insertions(+), 3 deletions(-)
> >>
> >> diff -puN mm/list_lru.c~list_lru-prefetch-neighboring-list-entries-before-acquiring-lock-fix mm/list_lru.c
> >> --- a/mm/list_lru.c~list_lru-prefetch-neighboring-list-entries-before-acquiring-lock-fix
> >> +++ a/mm/list_lru.c
> >> @@ -8,6 +8,7 @@
> >>  #include <linux/module.h>
> >>  #include <linux/mm.h>
> >>  #include <linux/list_lru.h>
> >> +#include <linux/prefetch.h>
> >>  #include <linux/slab.h>
> >>  #include <linux/mutex.h>
> >>  #include <linux/memcontrol.h>
> >> @@ -135,13 +136,11 @@ bool list_lru_del(struct list_lru *lru,
> >>  	/*
> >>  	 * Prefetch the neighboring list entries to reduce lock hold time.
> >>  	 */
> >> -	if (unlikely(list_empty(item)))
> >> -		return false;
> >>  	prefetchw(item->prev);
> >>  	prefetchw(item->next);
> >>  
> >>  	spin_lock(&nlru->lock);
> >> -	if (likely(!list_empty(item))) {
> >> +	if (!list_empty(item)) {
> >>  		l = list_lru_from_kmem(nlru, item);
> >>  		list_del_init(item);
> >>  		l->nr_items--;
> > If we cannot guarantee it's likely !list_empty, prefetch with NULL pointer
> > would be harmful by the lesson we have learned.
> >
> >         https://lwn.net/Articles/444336/
> 
> FYI, when list_empty() is true, it just mean the links are pointing to
> list entry itself. The pointers will never be NULL. So that won't cause
> the NULL prefetch problem mentioned in the article.

Sure, but that misses the larger point of the article in that there
are many unpredictable side effects of adding prefetches, the least
of which is that the result is CPU specific. Some CPUs improve,
others regress, and there's no predicting which side of the ledger
any given CPU falls on.

So from that perspective, we should consider manual prefetching
harmful, similar to the way that likely/unlikely is generally
considered harmful. i.e. it's pretty much impossible for a
programmer to get right in all situations.

> 
> > So, with considering list_lru_del is generic library, it cannot see
> > whether a workload makes heavy lock contentions or not.
> > Maybe, right place for prefetching would be in caller, not in library
> > itself.
> 
> Yes, the prefetch operations will add some overhead to the whole
> deletion operation when the lock isn't contended, but that is usually
> rather small compared with the atomic ops involved in the locking
> operation itself. On the other hand, the performance gain will be
> noticeable when the lock is contended. I will ran some performance
> measurement and report the results later.

Given the extreme fuzziness of the benefit of manual prefetching,
you'll need to:
	a) document the test
	b) run the test on mulitple architectures
	c) run the test on mulitple different CPU models within
	   the x86-64 architecture
	d) show that it works in general for a majority of the
	   platforms and CPUs you benched, and not just for your
	   microbenchmark.

Basically, if there's historic context with people like Ingo saying
"prefetching is toxic" then there's a bloody high bar you need to
get over to add manual prefetching anywhere...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] list_lru: Prefetch neighboring list entries before acquiring lock
  2017-11-29 14:17 [PATCH] list_lru: Prefetch neighboring list entries before acquiring lock Waiman Long
                   ` (2 preceding siblings ...)
  2017-11-30 14:34 ` Matthew Wilcox
@ 2017-12-05 14:49 ` Michal Hocko
  2017-12-05 23:56   ` Andrew Morton
  3 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2017-12-05 14:49 UTC (permalink / raw)
  To: Andrew Morton, Waiman Long
  Cc: Vladimir Davydov, Johannes Weiner, Dave Chinner, linux-kernel, linux-mm

On Wed 29-11-17 09:17:34, Waiman Long wrote:
> The list_lru_del() function removes the given item from the LRU list.
> The operation looks simple, but it involves writing into the cachelines
> of the two neighboring list entries in order to get the deletion done.
> That can take a while if the cachelines aren't there yet, thus
> prolonging the lock hold time.
> 
> To reduce the lock hold time, the cachelines of the two neighboring
> list entries are now prefetched before acquiring the list_lru_node's
> lock.
> 
> Using a multi-threaded test program that created a large number
> of dentries and then killed them, the execution time was reduced
> from 38.5s to 36.6s after applying the patch on a 2-socket 36-core
> 72-thread x86-64 system.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>

The patch still seems to be in the mmotm tree while it breaks
compilation. At least m32r defconfig complains with
mm/list_lru.c: In function 'list_lru_del':
mm/list_lru.c:141:2: error: implicit declaration of function 'prefetchw' [-Werror=implicit-function-declaration]
  prefetchw(item->prev);

It also seems that there is no general agreement in the patch. Andrew,
do you plan to keep it?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] list_lru: Prefetch neighboring list entries before acquiring lock
  2017-12-05 14:49 ` Michal Hocko
@ 2017-12-05 23:56   ` Andrew Morton
  2017-12-06  8:07     ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2017-12-05 23:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Waiman Long, Vladimir Davydov, Johannes Weiner, Dave Chinner,
	linux-kernel, linux-mm

On Tue, 5 Dec 2017 15:49:48 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> On Wed 29-11-17 09:17:34, Waiman Long wrote:
> > The list_lru_del() function removes the given item from the LRU list.
> > The operation looks simple, but it involves writing into the cachelines
> > of the two neighboring list entries in order to get the deletion done.
> > That can take a while if the cachelines aren't there yet, thus
> > prolonging the lock hold time.
> > 
> > To reduce the lock hold time, the cachelines of the two neighboring
> > list entries are now prefetched before acquiring the list_lru_node's
> > lock.
> > 
> > Using a multi-threaded test program that created a large number
> > of dentries and then killed them, the execution time was reduced
> > from 38.5s to 36.6s after applying the patch on a 2-socket 36-core
> > 72-thread x86-64 system.
> > 
> > Signed-off-by: Waiman Long <longman@redhat.com>
> 
> The patch still seems to be in the mmotm tree while it breaks
> compilation. At least m32r defconfig complains with
> mm/list_lru.c: In function 'list_lru_del':
> mm/list_lru.c:141:2: error: implicit declaration of function 'prefetchw' [-Werror=implicit-function-declaration]
>   prefetchw(item->prev);

erp, I forgot to cc Stephen.

> It also seems that there is no general agreement in the patch. Andrew,
> do you plan to keep it?

It's in wait-and-see mode.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] list_lru: Prefetch neighboring list entries before acquiring lock
  2017-12-05 23:56   ` Andrew Morton
@ 2017-12-06  8:07     ` Michal Hocko
  0 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2017-12-06  8:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Waiman Long, Vladimir Davydov, Johannes Weiner, Dave Chinner,
	linux-kernel, linux-mm

On Tue 05-12-17 15:56:18, Andrew Morton wrote:
> On Tue, 5 Dec 2017 15:49:48 +0100 Michal Hocko <mhocko@kernel.org> wrote:
[...]
> > It also seems that there is no general agreement in the patch. Andrew,
> > do you plan to keep it?
> 
> It's in wait-and-see mode.

OK, I will remove m32r from my compile test battery.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-12-06  8:07 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-29 14:17 [PATCH] list_lru: Prefetch neighboring list entries before acquiring lock Waiman Long
2017-11-29 21:53 ` Andrew Morton
2017-11-30  0:42   ` Dave Chinner
2017-11-30 13:54     ` Waiman Long
2017-11-30 20:38       ` Dave Chinner
2017-11-30 20:55         ` Waiman Long
2017-11-30 20:47       ` Andrew Morton
2017-11-30 20:49         ` Waiman Long
2017-12-01  0:09         ` Minchan Kim
2017-12-01 14:14           ` Waiman Long
2017-12-01 22:02             ` Dave Chinner
2017-11-30  0:53 ` Minchan Kim
2017-11-30 13:43   ` Waiman Long
2017-11-30 23:53     ` Minchan Kim
2017-11-30 14:34 ` Matthew Wilcox
2017-12-05 14:49 ` Michal Hocko
2017-12-05 23:56   ` Andrew Morton
2017-12-06  8:07     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).