linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] gpu/drm/ttm: Use mutex_lock_killable() for shrinker functions.
@ 2014-05-19 14:39 Tetsuo Handa
  2014-05-20  0:40 ` Dave Airlie
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2014-05-19 14:39 UTC (permalink / raw)
  To: dchinner, glommer, mgorman, airlied; +Cc: linux-kernel

>From e314a1a1583e585d062dfc30c8aad8bf5380510b Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Mon, 19 May 2014 18:43:21 +0900
Subject: [PATCH] gpu/drm/ttm: Use mutex_lock_killable() for shrinker functions.

I can observe that RHEL7 environment stalls with 100% CPU usage when a
certain type of memory pressure is given. While the shrinker functions
are called by shrink_slab() before the OOM killer is triggered, the stall
lasts for many minutes.

I added debug printk() and observed that many threads are blocked for more
than 10 seconds at ttm_dma_pool_shrink_count()/ttm_dma_pool_shrink_scan()
functions. Since the kswapd can call these functions later, the current
thread can return from these functions as soon as chosen by the OOM killer.

This patch changes "mutex_lock();" to "if (mutex_lock_killable()) return ...;"
so that any threads can promptly give up. (By the way, as far as I tested,
changing to "if (!mutex_trylock()) return ...;" likely shortens the duration
of stall. Maybe we don't need to wait for mutex if someone is already calling
these functions.)

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
index 1b79bf0..f75dab8 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
@@ -1012,7 +1012,8 @@ ttm_dma_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 	if (list_empty(&_manager->pools))
 		return SHRINK_STOP;
 
-	mutex_lock(&_manager->lock);
+	if (mutex_lock_killable(&_manager->lock))
+		return SHRINK_STOP;
 	pool_offset = pool_offset % _manager->npools;
 	list_for_each_entry(p, &_manager->pools, pools) {
 		unsigned nr_free;
@@ -1043,7 +1044,8 @@ ttm_dma_pool_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
 	struct device_pools *p;
 	unsigned long count = 0;
 
-	mutex_lock(&_manager->lock);
+	if (mutex_lock_killable(&_manager->lock))
+		return 0;
 	list_for_each_entry(p, &_manager->pools, pools)
 		count += p->pool->npages_free;
 	mutex_unlock(&_manager->lock);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] gpu/drm/ttm: Use mutex_lock_killable() for shrinker functions.
  2014-05-19 14:39 [PATCH] gpu/drm/ttm: Use mutex_lock_killable() for shrinker functions Tetsuo Handa
@ 2014-05-20  0:40 ` Dave Airlie
  2014-05-20 15:30   ` Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Dave Airlie @ 2014-05-20  0:40 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: dchinner, glommer, mgorman, linux-kernel, DRI mailing list


cc'ing dri-devel.

> >From e314a1a1583e585d062dfc30c8aad8bf5380510b Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Mon, 19 May 2014 18:43:21 +0900
> Subject: [PATCH] gpu/drm/ttm: Use mutex_lock_killable() for shrinker functions.
> 
> I can observe that RHEL7 environment stalls with 100% CPU usage when a
> certain type of memory pressure is given. While the shrinker functions
> are called by shrink_slab() before the OOM killer is triggered, the stall
> lasts for many minutes.
> 
> I added debug printk() and observed that many threads are blocked for more
> than 10 seconds at ttm_dma_pool_shrink_count()/ttm_dma_pool_shrink_scan()
> functions. Since the kswapd can call these functions later, the current
> thread can return from these functions as soon as chosen by the OOM killer.
> 
> This patch changes "mutex_lock();" to "if (mutex_lock_killable()) return ...;"
> so that any threads can promptly give up. (By the way, as far as I tested,
> changing to "if (!mutex_trylock()) return ...;" likely shortens the duration
> of stall. Maybe we don't need to wait for mutex if someone is already calling
> these functions.)
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  drivers/gpu/drm/ttm/ttm_page_alloc_dma.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
> index 1b79bf0..f75dab8 100644
> --- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
> +++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
> @@ -1012,7 +1012,8 @@ ttm_dma_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
>  	if (list_empty(&_manager->pools))
>  		return SHRINK_STOP;
>  
> -	mutex_lock(&_manager->lock);
> +	if (mutex_lock_killable(&_manager->lock))
> +		return SHRINK_STOP;
>  	pool_offset = pool_offset % _manager->npools;
>  	list_for_each_entry(p, &_manager->pools, pools) {
>  		unsigned nr_free;
> @@ -1043,7 +1044,8 @@ ttm_dma_pool_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
>  	struct device_pools *p;
>  	unsigned long count = 0;
>  
> -	mutex_lock(&_manager->lock);
> +	if (mutex_lock_killable(&_manager->lock))
> +		return 0;
>  	list_for_each_entry(p, &_manager->pools, pools)
>  		count += p->pool->npages_free;
>  	mutex_unlock(&_manager->lock);
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] gpu/drm/ttm: Use mutex_lock_killable() for shrinker functions.
  2014-05-20  0:40 ` Dave Airlie
@ 2014-05-20 15:30   ` Tetsuo Handa
  2014-05-24 14:22     ` Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2014-05-20 15:30 UTC (permalink / raw)
  To: dchinner; +Cc: airlied, glommer, mgorman, linux-kernel, dri-devel

Tetsuo Handa wrote:
> From e314a1a1583e585d062dfc30c8aad8bf5380510b Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Mon, 19 May 2014 18:43:21 +0900
> Subject: [PATCH] gpu/drm/ttm: Use mutex_lock_killable() for shrinker functions.
> 
> I can observe that RHEL7 environment stalls with 100% CPU usage when a
> certain type of memory pressure is given. While the shrinker functions
> are called by shrink_slab() before the OOM killer is triggered, the stall
> lasts for many minutes.
> 
> I added debug printk() and observed that many threads are blocked for more
> than 10 seconds at ttm_dma_pool_shrink_count()/ttm_dma_pool_shrink_scan()
> functions. Since the kswapd can call these functions later, the current
> thread can return from these functions as soon as chosen by the OOM killer.
> 
> This patch changes "mutex_lock();" to "if (mutex_lock_killable()) return ...;"
> so that any threads can promptly give up. (By the way, as far as I tested,
> changing to "if (!mutex_trylock()) return ...;" likely shortens the duration
> of stall. Maybe we don't need to wait for mutex if someone is already calling
> these functions.)
> 

While discussing about XFS problem, I got a question. Is it OK (from point
of view of reentrant) to use mutex_lock() or mutex_lock_killable() inside
shrinker's entry point functions? Can senario shown below possible?

(1) kswapd is doing memory reclaim which does not need to hold mutex.

(2) Someone in GFP_KERNEL context (not kswapd) calls
    ttm_dma_pool_shrink_count() and then calls ttm_dma_pool_shrink_scan()
    from direct reclaim path.

(3) Inside ttm_dma_pool_shrink_scan(), GFP_KERNEL allocation is issued
    while mutex is held by the someone.

(4) GFP_KERNEL allocation cannot be completed immediately due to memory
    pressure.

(5) kswapd calls ttm_dma_pool_shrink_count() which need to hold mutex.

(6) Inside ttm_dma_pool_shrink_count(), kswapd is blocked waiting for
    mutex held by the someone, and the someone is waiting for GFP_KERNEL
    allocation to complete, but GFP_KERNEL allocation cannot be completed
    until mutex held by the someone is released?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] gpu/drm/ttm: Use mutex_lock_killable() for shrinker functions.
  2014-05-20 15:30   ` Tetsuo Handa
@ 2014-05-24 14:22     ` Tetsuo Handa
  2014-05-28 18:54       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2014-05-24 14:22 UTC (permalink / raw)
  To: dchinner, konrad.wilk
  Cc: airlied, glommer, mgorman, linux-mm, linux-kernel, dri-devel

Hello.

I tried to test whether it is OK (from point of view of reentrant) to use
mutex_lock() or mutex_lock_killable() inside shrinker functions when shrinker
functions do memory allocation, for drivers/gpu/drm/ttm/ttm_page_alloc_dma.c is
doing memory allocation with mutex lock held inside ttm_dma_pool_shrink_scan().

If I compile a test module shown below which mimics extreme case of what
ttm_dma_pool_shrink_scan() will do

---------- test.c start ----------
#include <linux/module.h>
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/mm.h>

static DEFINE_MUTEX(lock);

static unsigned long shrink_test_count(struct shrinker *shrinker, struct shrink_control *sc)
{
        if (mutex_lock_killable(&lock)) {
                printk(KERN_WARNING "Process %u (%s) gave up waiting for mutex"
                       "\n", current->pid, current->comm);
                return 0;
        }
        mutex_unlock(&lock);
        return 1;
}

static unsigned long shrink_test_scan(struct shrinker *shrinker, struct shrink_control *sc)
{
        LIST_HEAD(list);
        int i = 0;
        if (mutex_lock_killable(&lock)) {
                printk(KERN_WARNING "Process %u (%s) gave up waiting for mutex"
                       "\n", current->pid, current->comm);
                return 0;
        }
        while (1) {
                struct list_head *l = kmalloc(PAGE_SIZE, sc->gfp_mask);
                if (!l)
                        break;
                list_add_tail(l, &list);
                i++;
        }
        printk(KERN_WARNING "Process %u (%s) allocated %u pages\n",
               current->pid, current->comm, i);
        while (i--) {
                struct list_head *l = list.next;
                list_del(l);
                kfree(l);
        }
        mutex_unlock(&lock);
        return 1;
}

static struct shrinker recursive_shrinker = {
        .count_objects = shrink_test_count,
        .scan_objects = shrink_test_scan,
        .seeks = DEFAULT_SEEKS,
};

static int __init recursive_shrinker_init(void)
{
        register_shrinker(&recursive_shrinker);
        return 0;
}

static void recursive_shrinker_exit(void)
{
        unregister_shrinker(&recursive_shrinker);
}

module_init(recursive_shrinker_init);
module_exit(recursive_shrinker_exit);
MODULE_LICENSE("GPL");
---------- test.c end ----------

and load the test module and do

  # echo 3 > /proc/sys/vm/drop_caches

the system stalls with 0% CPU usage because of mutex deadlock
(with prior lockdep warning).

Is this because wrong gfp flags are passed to kmalloc() ? Is this because
the test module's shrinker functions return wrong values? Is this because
doing memory allocation with mutex held inside shrinker functions is
forbidden? Can anybody tell me what is wrong with my test module?

Regards.

[   48.077353] 
[   48.077999] =================================
[   48.080023] [ INFO: inconsistent lock state ]
[   48.080023] 3.15.0-rc6-00190-g1ee1cea #203 Tainted: G           OE
[   48.080023] ---------------------------------
[   48.080023] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
[   48.086745] kswapd0/784 [HC0[0]:SC0[0]:HE1:SE1] takes:
[   48.086745]  (lock#2){+.+.?.}, at: [<e0861022>] shrink_test_count+0x12/0x60 [test]
[   48.086745] {RECLAIM_FS-ON-W} state was registered at:
[   48.086745]   [<c1089c18>] mark_held_locks+0x68/0x90
[   48.086745]   [<c1089cda>] lockdep_trace_alloc+0x9a/0xe0
[   48.086745]   [<c110b7f3>] kmem_cache_alloc+0x23/0x170
[   48.086745]   [<e08610aa>] shrink_test_scan+0x3a/0xf90 [test]
[   48.086745]   [<c10e59be>] shrink_slab_node+0x13e/0x1d0
[   48.086745]   [<c10e6911>] shrink_slab+0x61/0xe0
[   48.086745]   [<c115f849>] drop_caches_sysctl_handler+0x69/0xf0
[   48.086745]   [<c117275a>] proc_sys_call_handler+0x6a/0xa0
[   48.086745]   [<c11727aa>] proc_sys_write+0x1a/0x20
[   48.086745]   [<c1110ac0>] vfs_write+0xa0/0x190
[   48.086745]   [<c1110ca6>] SyS_write+0x56/0xc0
[   48.086745]   [<c15201d6>] syscall_call+0x7/0xb
[   48.086745] irq event stamp: 39
[   48.086745] hardirqs last  enabled at (39): [<c10f3480>] count_shadow_nodes+0x20/0x40
[   48.086745] hardirqs last disabled at (38): [<c10f346c>] count_shadow_nodes+0xc/0x40
[   48.086745] softirqs last  enabled at (0): [<c1040627>] copy_process+0x2e7/0x1400
[   48.086745] softirqs last disabled at (0): [<  (null)>]   (null)
[   48.086745] 
[   48.086745] other info that might help us debug this:
[   48.086745]  Possible unsafe locking scenario:
[   48.086745] 
[   48.086745]        CPU0
[   48.086745]        ----
[   48.086745]   lock(lock#2);
[   48.086745]   <Interrupt>
[   48.086745]     lock(lock#2);
[   48.086745] 
[   48.086745]  *** DEADLOCK ***
[   48.086745] 
[   48.086745] 1 lock held by kswapd0/784:
[   48.086745]  #0:  (shrinker_rwsem){++++.+}, at: [<c10e68da>] shrink_slab+0x2a/0xe0
[   48.086745] 
[   48.086745] stack backtrace:
[   48.086745] CPU: 1 PID: 784 Comm: kswapd0 Tainted: G           OE 3.15.0-rc6-00190-g1ee1cea #203
[   48.086745] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 08/15/2008
[   48.086745]  c1ab9c20 dd187c94 c151a48f dd184250 dd187cd0 c1088f33 c165aa02 c165ac9d
[   48.086745]  00000310 00000000 00000000 00000000 00000000 00000001 00000001 c165ac9d
[   48.086745]  dd1847dc 0000000a 00000008 dd187cfc c1089ae1 00000008 000001b8 31a0987d
[   48.086745] Call Trace:
[   48.086745]  [<c151a48f>] dump_stack+0x48/0x61
[   48.086745]  [<c1088f33>] print_usage_bug+0x1f3/0x250
[   48.086745]  [<c1089ae1>] mark_lock+0x331/0x400
[   48.086745]  [<c1088f90>] ? print_usage_bug+0x250/0x250
[   48.086745]  [<c108a583>] __lock_acquire+0x283/0x1640
[   48.086745]  [<c108b9bb>] lock_acquire+0x7b/0xa0
[   48.086745]  [<e0861022>] ? shrink_test_count+0x12/0x60 [test]
[   48.086745]  [<c151c544>] mutex_lock_killable_nested+0x64/0x3e0
[   48.086745]  [<e0861022>] ? shrink_test_count+0x12/0x60 [test]
[   48.086745]  [<e0861022>] ? shrink_test_count+0x12/0x60 [test]
[   48.086745]  [<c11119f1>] ? put_super+0x21/0x30
[   48.086745]  [<e0861022>] shrink_test_count+0x12/0x60 [test]
[   48.086745]  [<c10e58ae>] shrink_slab_node+0x2e/0x1d0
[   48.086745]  [<c10e68da>] ? shrink_slab+0x2a/0xe0
[   48.086745]  [<c10e6911>] shrink_slab+0x61/0xe0
[   48.086745]  [<c10e8416>] kswapd+0x5f6/0x8e0
[   48.086745]  [<c1062e0f>] kthread+0xaf/0xd0
[   48.086745]  [<c10e7e20>] ? try_to_free_pages+0x540/0x540
[   48.086745]  [<c108a08b>] ? trace_hardirqs_on+0xb/0x10
[   48.086745]  [<c1525d41>] ret_from_kernel_thread+0x21/0x30
[   48.086745]  [<c1062d60>] ? __init_kthread_worker+0x60/0x60

[   77.958388] SysRq : Show State
[   77.959377]   task                PC stack   pid father
[   77.960803] bash            D dfa6ae80  5068     1      0 0x00000000
[   77.962348]  ded35c30 00000046 dfa6ae90 dfa6ae80 322a9328 00000000 00000000 0000000b
[   77.962348]  ded32010 c191c8c0 ded34008 32319d5a 0000000b c191c8c0 32319d5a 0000000b
[   77.962348]  ded32010 00000001 ded35c04 3230ea9d 0000000b e0863060 ded325c4 ded32010
[   77.962348] Call Trace:
[   77.962348]  [<c1073ab7>] ? local_clock+0x17/0x30
[   77.962348]  [<c151b41e>] schedule+0x1e/0x60
[   77.962348]  [<c151b6df>] schedule_preempt_disabled+0xf/0x20
[   77.962348]  [<c151c63f>] mutex_lock_killable_nested+0x15f/0x3e0
[   77.962348]  [<e0861022>] ? shrink_test_count+0x12/0x60 [test]
[   77.962348]  [<e0861022>] ? shrink_test_count+0x12/0x60 [test]
[   77.962348]  [<e0861022>] shrink_test_count+0x12/0x60 [test]
[   77.962348]  [<c10e58ae>] shrink_slab_node+0x2e/0x1d0
[   77.962348]  [<c10e68da>] ? shrink_slab+0x2a/0xe0
[   77.962348]  [<c10e6911>] shrink_slab+0x61/0xe0
[   77.962348]  [<c10e7b48>] try_to_free_pages+0x268/0x540
[   77.962348]  [<c10df529>] __alloc_pages_nodemask+0x3e9/0x720
[   77.962348]  [<c110bcbd>] cache_alloc_refill+0x37d/0x720
[   77.962348]  [<e08610aa>] ? shrink_test_scan+0x3a/0xf90 [test]
[   77.962348]  [<c110b902>] kmem_cache_alloc+0x132/0x170
[   77.962348]  [<e08610aa>] ? shrink_test_scan+0x3a/0xf90 [test]
[   77.962348]  [<e08610aa>] shrink_test_scan+0x3a/0xf90 [test]
[   77.962348]  [<c151c4d8>] ? mutex_unlock+0x8/0x10
[   77.962348]  [<c10e59be>] shrink_slab_node+0x13e/0x1d0
[   77.962348]  [<c10e68da>] ? shrink_slab+0x2a/0xe0
[   77.962348]  [<c10e6911>] shrink_slab+0x61/0xe0
[   77.962348]  [<c115f849>] drop_caches_sysctl_handler+0x69/0xf0
[   77.962348]  [<c151fe6d>] ? _raw_spin_unlock+0x1d/0x30
[   77.962348]  [<c117275a>] proc_sys_call_handler+0x6a/0xa0
[   77.962348]  [<c11727aa>] proc_sys_write+0x1a/0x20
[   77.962348]  [<c1110ac0>] vfs_write+0xa0/0x190
[   77.962348]  [<c1172790>] ? proc_sys_call_handler+0xa0/0xa0
[   77.962348]  [<c112d0fd>] ? __fdget+0xd/0x10
[   77.962348]  [<c1110ca6>] SyS_write+0x56/0xc0
[   77.962348]  [<c15201d6>] syscall_call+0x7/0xb

[   77.962348] kswapd0         D 00000246  6200   784      2 0x00000000
[   77.962348]  dd187d9c 00000046 c109d091 00000246 00000086 00000000 00000246 dd184250
[   77.962348]  dd184250 c191c8c0 dd186008 318e2084 0000000b c191c8c0 37e97ef1 0000000b
[   77.962348]  dd184250 dd184250 dd187d70 c10880aa dd187dac 00000000 0000007b ffffffff
[   77.962348] Call Trace:
[   77.962348]  [<c109d091>] ? rcu_irq_exit+0x71/0xc0
[   77.962348]  [<c10880aa>] ? print_lock_contention_bug+0x1a/0xf0
[   77.962348]  [<c151b41e>] schedule+0x1e/0x60
[   77.962348]  [<c151b6df>] schedule_preempt_disabled+0xf/0x20
[   77.962348]  [<c151c63f>] mutex_lock_killable_nested+0x15f/0x3e0
[   77.962348]  [<e0861022>] ? shrink_test_count+0x12/0x60 [test]
[   77.962348]  [<e0861022>] ? shrink_test_count+0x12/0x60 [test]
[   77.962348]  [<e0861022>] shrink_test_count+0x12/0x60 [test]
[   77.962348]  [<c10e58ae>] shrink_slab_node+0x2e/0x1d0
[   77.962348]  [<c10e68da>] ? shrink_slab+0x2a/0xe0
[   77.962348]  [<c10e6911>] shrink_slab+0x61/0xe0
[   77.962348]  [<c10e8416>] kswapd+0x5f6/0x8e0
[   77.962348]  [<c1062e0f>] kthread+0xaf/0xd0
[   77.962348]  [<c10e7e20>] ? try_to_free_pages+0x540/0x540
[   77.962348]  [<c108a08b>] ? trace_hardirqs_on+0xb/0x10
[   77.962348]  [<c1525d41>] ret_from_kernel_thread+0x21/0x30
[   77.962348]  [<c1062d60>] ? __init_kthread_worker+0x60/0x60

Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > From e314a1a1583e585d062dfc30c8aad8bf5380510b Mon Sep 17 00:00:00 2001
> > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Date: Mon, 19 May 2014 18:43:21 +0900
> > Subject: [PATCH] gpu/drm/ttm: Use mutex_lock_killable() for shrinker functions.
> > 
> > I can observe that RHEL7 environment stalls with 100% CPU usage when a
> > certain type of memory pressure is given. While the shrinker functions
> > are called by shrink_slab() before the OOM killer is triggered, the stall
> > lasts for many minutes.
> > 
> > I added debug printk() and observed that many threads are blocked for more
> > than 10 seconds at ttm_dma_pool_shrink_count()/ttm_dma_pool_shrink_scan()
> > functions. Since the kswapd can call these functions later, the current
> > thread can return from these functions as soon as chosen by the OOM killer.
> > 
> > This patch changes "mutex_lock();" to "if (mutex_lock_killable()) return ...;"
> > so that any threads can promptly give up. (By the way, as far as I tested,
> > changing to "if (!mutex_trylock()) return ...;" likely shortens the duration
> > of stall. Maybe we don't need to wait for mutex if someone is already calling
> > these functions.)
> > 
> 
> While discussing about XFS problem, I got a question. Is it OK (from point
> of view of reentrant) to use mutex_lock() or mutex_lock_killable() inside
> shrinker's entry point functions? Can senario shown below possible?
> 
> (1) kswapd is doing memory reclaim which does not need to hold mutex.
> 
> (2) Someone in GFP_KERNEL context (not kswapd) calls
>     ttm_dma_pool_shrink_count() and then calls ttm_dma_pool_shrink_scan()
>     from direct reclaim path.
> 
> (3) Inside ttm_dma_pool_shrink_scan(), GFP_KERNEL allocation is issued
>     while mutex is held by the someone.
> 
> (4) GFP_KERNEL allocation cannot be completed immediately due to memory
>     pressure.
> 
> (5) kswapd calls ttm_dma_pool_shrink_count() which need to hold mutex.
> 
> (6) Inside ttm_dma_pool_shrink_count(), kswapd is blocked waiting for
>     mutex held by the someone, and the someone is waiting for GFP_KERNEL
>     allocation to complete, but GFP_KERNEL allocation cannot be completed
>     until mutex held by the someone is released?
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] gpu/drm/ttm: Use mutex_lock_killable() for shrinker functions.
  2014-05-24 14:22     ` Tetsuo Handa
@ 2014-05-28 18:54       ` Konrad Rzeszutek Wilk
  2014-05-28 21:47         ` Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-05-28 18:54 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: dchinner, airlied, glommer, mgorman, linux-mm, linux-kernel, dri-devel

On Sat, May 24, 2014 at 11:22:09PM +0900, Tetsuo Handa wrote:
> Hello.
> 
> I tried to test whether it is OK (from point of view of reentrant) to use
> mutex_lock() or mutex_lock_killable() inside shrinker functions when shrinker
> functions do memory allocation, for drivers/gpu/drm/ttm/ttm_page_alloc_dma.c is
> doing memory allocation with mutex lock held inside ttm_dma_pool_shrink_scan().
> 
> If I compile a test module shown below which mimics extreme case of what
> ttm_dma_pool_shrink_scan() will do

And ttm_pool_shrink_scan.

> 
> ---------- test.c start ----------
> #include <linux/module.h>
> #include <linux/sched.h>
> #include <linux/slab.h>
> #include <linux/mm.h>
> 
> static DEFINE_MUTEX(lock);
> 
> static unsigned long shrink_test_count(struct shrinker *shrinker, struct shrink_control *sc)
> {
>         if (mutex_lock_killable(&lock)) {
>                 printk(KERN_WARNING "Process %u (%s) gave up waiting for mutex"
>                        "\n", current->pid, current->comm);
>                 return 0;
>         }
>         mutex_unlock(&lock);
>         return 1;
> }
> 
> static unsigned long shrink_test_scan(struct shrinker *shrinker, struct shrink_control *sc)
> {
>         LIST_HEAD(list);
>         int i = 0;
>         if (mutex_lock_killable(&lock)) {
>                 printk(KERN_WARNING "Process %u (%s) gave up waiting for mutex"
>                        "\n", current->pid, current->comm);
>                 return 0;
>         }
>         while (1) {
>                 struct list_head *l = kmalloc(PAGE_SIZE, sc->gfp_mask);
>                 if (!l)
>                         break;
>                 list_add_tail(l, &list);
>                 i++;
>         }
>         printk(KERN_WARNING "Process %u (%s) allocated %u pages\n",
>                current->pid, current->comm, i);
>         while (i--) {
>                 struct list_head *l = list.next;
>                 list_del(l);
>                 kfree(l);
>         }
>         mutex_unlock(&lock);
>         return 1;
> }
> 
> static struct shrinker recursive_shrinker = {
>         .count_objects = shrink_test_count,
>         .scan_objects = shrink_test_scan,
>         .seeks = DEFAULT_SEEKS,
> };
> 
> static int __init recursive_shrinker_init(void)
> {
>         register_shrinker(&recursive_shrinker);
>         return 0;
> }
> 
> static void recursive_shrinker_exit(void)
> {
>         unregister_shrinker(&recursive_shrinker);
> }
> 
> module_init(recursive_shrinker_init);
> module_exit(recursive_shrinker_exit);
> MODULE_LICENSE("GPL");
> ---------- test.c end ----------
> 
> and load the test module and do
> 
>   # echo 3 > /proc/sys/vm/drop_caches
> 
> the system stalls with 0% CPU usage because of mutex deadlock
> (with prior lockdep warning).
> 
> Is this because wrong gfp flags are passed to kmalloc() ? Is this because
> the test module's shrinker functions return wrong values? Is this because
> doing memory allocation with mutex held inside shrinker functions is
> forbidden? Can anybody tell me what is wrong with my test module?

What is the sc->gfp_flags? What if you use GFP_ATOMIC?

In regards to the lockdep warning below it looks like
> 
> Regards.
> 
> [   48.077353] 
> [   48.077999] =================================
> [   48.080023] [ INFO: inconsistent lock state ]
> [   48.080023] 3.15.0-rc6-00190-g1ee1cea #203 Tainted: G           OE
> [   48.080023] ---------------------------------
> [   48.080023] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
> [   48.086745] kswapd0/784 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [   48.086745]  (lock#2){+.+.?.}, at: [<e0861022>] shrink_test_count+0x12/0x60 [test]
> [   48.086745] {RECLAIM_FS-ON-W} state was registered at:


You have the scenario you described below, that is:

shrink_test_scan	
	mutex_lock_killable()
		-> kmalloc
			-> shrink_test_count
				mutex_lock_killable()

And 'mutex_lock_killable' is the same (in at least this context)
the same as 'mutex_lock'. In other words, your second 'mutex_lock'
is going to spin - which is a deadlock.

Perhaps a way of not getting in this scenario is:

 1). Try to take the mutex (ie, one that won't spin if it can't
     get it).

 2). Use the GFP_ATOMIC in the shrinker so that we never
     end up calling ourselves in case of memory pressure

?

> [   48.086745]   [<c1089c18>] mark_held_locks+0x68/0x90
> [   48.086745]   [<c1089cda>] lockdep_trace_alloc+0x9a/0xe0
> [   48.086745]   [<c110b7f3>] kmem_cache_alloc+0x23/0x170
> [   48.086745]   [<e08610aa>] shrink_test_scan+0x3a/0xf90 [test]
> [   48.086745]   [<c10e59be>] shrink_slab_node+0x13e/0x1d0
> [   48.086745]   [<c10e6911>] shrink_slab+0x61/0xe0
> [   48.086745]   [<c115f849>] drop_caches_sysctl_handler+0x69/0xf0
> [   48.086745]   [<c117275a>] proc_sys_call_handler+0x6a/0xa0
> [   48.086745]   [<c11727aa>] proc_sys_write+0x1a/0x20
> [   48.086745]   [<c1110ac0>] vfs_write+0xa0/0x190
> [   48.086745]   [<c1110ca6>] SyS_write+0x56/0xc0
> [   48.086745]   [<c15201d6>] syscall_call+0x7/0xb
> [   48.086745] irq event stamp: 39
> [   48.086745] hardirqs last  enabled at (39): [<c10f3480>] count_shadow_nodes+0x20/0x40
> [   48.086745] hardirqs last disabled at (38): [<c10f346c>] count_shadow_nodes+0xc/0x40
> [   48.086745] softirqs last  enabled at (0): [<c1040627>] copy_process+0x2e7/0x1400
> [   48.086745] softirqs last disabled at (0): [<  (null)>]   (null)
> [   48.086745] 
> [   48.086745] other info that might help us debug this:
> [   48.086745]  Possible unsafe locking scenario:
> [   48.086745] 
> [   48.086745]        CPU0
> [   48.086745]        ----
> [   48.086745]   lock(lock#2);
> [   48.086745]   <Interrupt>
> [   48.086745]     lock(lock#2);
> [   48.086745] 
> [   48.086745]  *** DEADLOCK ***
> [   48.086745] 
> [   48.086745] 1 lock held by kswapd0/784:
> [   48.086745]  #0:  (shrinker_rwsem){++++.+}, at: [<c10e68da>] shrink_slab+0x2a/0xe0
> [   48.086745] 
> [   48.086745] stack backtrace:
> [   48.086745] CPU: 1 PID: 784 Comm: kswapd0 Tainted: G           OE 3.15.0-rc6-00190-g1ee1cea #203
> [   48.086745] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 08/15/2008
> [   48.086745]  c1ab9c20 dd187c94 c151a48f dd184250 dd187cd0 c1088f33 c165aa02 c165ac9d
> [   48.086745]  00000310 00000000 00000000 00000000 00000000 00000001 00000001 c165ac9d
> [   48.086745]  dd1847dc 0000000a 00000008 dd187cfc c1089ae1 00000008 000001b8 31a0987d
> [   48.086745] Call Trace:
> [   48.086745]  [<c151a48f>] dump_stack+0x48/0x61
> [   48.086745]  [<c1088f33>] print_usage_bug+0x1f3/0x250
> [   48.086745]  [<c1089ae1>] mark_lock+0x331/0x400
> [   48.086745]  [<c1088f90>] ? print_usage_bug+0x250/0x250
> [   48.086745]  [<c108a583>] __lock_acquire+0x283/0x1640
> [   48.086745]  [<c108b9bb>] lock_acquire+0x7b/0xa0
> [   48.086745]  [<e0861022>] ? shrink_test_count+0x12/0x60 [test]
> [   48.086745]  [<c151c544>] mutex_lock_killable_nested+0x64/0x3e0
> [   48.086745]  [<e0861022>] ? shrink_test_count+0x12/0x60 [test]
> [   48.086745]  [<e0861022>] ? shrink_test_count+0x12/0x60 [test]
> [   48.086745]  [<c11119f1>] ? put_super+0x21/0x30
> [   48.086745]  [<e0861022>] shrink_test_count+0x12/0x60 [test]
> [   48.086745]  [<c10e58ae>] shrink_slab_node+0x2e/0x1d0
> [   48.086745]  [<c10e68da>] ? shrink_slab+0x2a/0xe0
> [   48.086745]  [<c10e6911>] shrink_slab+0x61/0xe0
> [   48.086745]  [<c10e8416>] kswapd+0x5f6/0x8e0
> [   48.086745]  [<c1062e0f>] kthread+0xaf/0xd0
> [   48.086745]  [<c10e7e20>] ? try_to_free_pages+0x540/0x540
> [   48.086745]  [<c108a08b>] ? trace_hardirqs_on+0xb/0x10
> [   48.086745]  [<c1525d41>] ret_from_kernel_thread+0x21/0x30
> [   48.086745]  [<c1062d60>] ? __init_kthread_worker+0x60/0x60
> 
> [   77.958388] SysRq : Show State
> [   77.959377]   task                PC stack   pid father
> [   77.960803] bash            D dfa6ae80  5068     1      0 0x00000000
> [   77.962348]  ded35c30 00000046 dfa6ae90 dfa6ae80 322a9328 00000000 00000000 0000000b
> [   77.962348]  ded32010 c191c8c0 ded34008 32319d5a 0000000b c191c8c0 32319d5a 0000000b
> [   77.962348]  ded32010 00000001 ded35c04 3230ea9d 0000000b e0863060 ded325c4 ded32010
> [   77.962348] Call Trace:
> [   77.962348]  [<c1073ab7>] ? local_clock+0x17/0x30
> [   77.962348]  [<c151b41e>] schedule+0x1e/0x60
> [   77.962348]  [<c151b6df>] schedule_preempt_disabled+0xf/0x20
> [   77.962348]  [<c151c63f>] mutex_lock_killable_nested+0x15f/0x3e0
> [   77.962348]  [<e0861022>] ? shrink_test_count+0x12/0x60 [test]
> [   77.962348]  [<e0861022>] ? shrink_test_count+0x12/0x60 [test]
> [   77.962348]  [<e0861022>] shrink_test_count+0x12/0x60 [test]
> [   77.962348]  [<c10e58ae>] shrink_slab_node+0x2e/0x1d0
> [   77.962348]  [<c10e68da>] ? shrink_slab+0x2a/0xe0
> [   77.962348]  [<c10e6911>] shrink_slab+0x61/0xe0
> [   77.962348]  [<c10e7b48>] try_to_free_pages+0x268/0x540
> [   77.962348]  [<c10df529>] __alloc_pages_nodemask+0x3e9/0x720
> [   77.962348]  [<c110bcbd>] cache_alloc_refill+0x37d/0x720
> [   77.962348]  [<e08610aa>] ? shrink_test_scan+0x3a/0xf90 [test]
> [   77.962348]  [<c110b902>] kmem_cache_alloc+0x132/0x170
> [   77.962348]  [<e08610aa>] ? shrink_test_scan+0x3a/0xf90 [test]
> [   77.962348]  [<e08610aa>] shrink_test_scan+0x3a/0xf90 [test]
> [   77.962348]  [<c151c4d8>] ? mutex_unlock+0x8/0x10
> [   77.962348]  [<c10e59be>] shrink_slab_node+0x13e/0x1d0
> [   77.962348]  [<c10e68da>] ? shrink_slab+0x2a/0xe0
> [   77.962348]  [<c10e6911>] shrink_slab+0x61/0xe0
> [   77.962348]  [<c115f849>] drop_caches_sysctl_handler+0x69/0xf0
> [   77.962348]  [<c151fe6d>] ? _raw_spin_unlock+0x1d/0x30
> [   77.962348]  [<c117275a>] proc_sys_call_handler+0x6a/0xa0
> [   77.962348]  [<c11727aa>] proc_sys_write+0x1a/0x20
> [   77.962348]  [<c1110ac0>] vfs_write+0xa0/0x190
> [   77.962348]  [<c1172790>] ? proc_sys_call_handler+0xa0/0xa0
> [   77.962348]  [<c112d0fd>] ? __fdget+0xd/0x10
> [   77.962348]  [<c1110ca6>] SyS_write+0x56/0xc0
> [   77.962348]  [<c15201d6>] syscall_call+0x7/0xb
> 
> [   77.962348] kswapd0         D 00000246  6200   784      2 0x00000000
> [   77.962348]  dd187d9c 00000046 c109d091 00000246 00000086 00000000 00000246 dd184250
> [   77.962348]  dd184250 c191c8c0 dd186008 318e2084 0000000b c191c8c0 37e97ef1 0000000b
> [   77.962348]  dd184250 dd184250 dd187d70 c10880aa dd187dac 00000000 0000007b ffffffff
> [   77.962348] Call Trace:
> [   77.962348]  [<c109d091>] ? rcu_irq_exit+0x71/0xc0
> [   77.962348]  [<c10880aa>] ? print_lock_contention_bug+0x1a/0xf0
> [   77.962348]  [<c151b41e>] schedule+0x1e/0x60
> [   77.962348]  [<c151b6df>] schedule_preempt_disabled+0xf/0x20
> [   77.962348]  [<c151c63f>] mutex_lock_killable_nested+0x15f/0x3e0
> [   77.962348]  [<e0861022>] ? shrink_test_count+0x12/0x60 [test]
> [   77.962348]  [<e0861022>] ? shrink_test_count+0x12/0x60 [test]
> [   77.962348]  [<e0861022>] shrink_test_count+0x12/0x60 [test]
> [   77.962348]  [<c10e58ae>] shrink_slab_node+0x2e/0x1d0
> [   77.962348]  [<c10e68da>] ? shrink_slab+0x2a/0xe0
> [   77.962348]  [<c10e6911>] shrink_slab+0x61/0xe0
> [   77.962348]  [<c10e8416>] kswapd+0x5f6/0x8e0
> [   77.962348]  [<c1062e0f>] kthread+0xaf/0xd0
> [   77.962348]  [<c10e7e20>] ? try_to_free_pages+0x540/0x540
> [   77.962348]  [<c108a08b>] ? trace_hardirqs_on+0xb/0x10
> [   77.962348]  [<c1525d41>] ret_from_kernel_thread+0x21/0x30
> [   77.962348]  [<c1062d60>] ? __init_kthread_worker+0x60/0x60
> 
> Tetsuo Handa wrote:
> > Tetsuo Handa wrote:
> > > From e314a1a1583e585d062dfc30c8aad8bf5380510b Mon Sep 17 00:00:00 2001
> > > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > Date: Mon, 19 May 2014 18:43:21 +0900
> > > Subject: [PATCH] gpu/drm/ttm: Use mutex_lock_killable() for shrinker functions.
> > > 
> > > I can observe that RHEL7 environment stalls with 100% CPU usage when a
> > > certain type of memory pressure is given. While the shrinker functions
> > > are called by shrink_slab() before the OOM killer is triggered, the stall
> > > lasts for many minutes.
> > > 
> > > I added debug printk() and observed that many threads are blocked for more
> > > than 10 seconds at ttm_dma_pool_shrink_count()/ttm_dma_pool_shrink_scan()
> > > functions. Since the kswapd can call these functions later, the current
> > > thread can return from these functions as soon as chosen by the OOM killer.
> > > 
> > > This patch changes "mutex_lock();" to "if (mutex_lock_killable()) return ...;"
> > > so that any threads can promptly give up. (By the way, as far as I tested,
> > > changing to "if (!mutex_trylock()) return ...;" likely shortens the duration
> > > of stall. Maybe we don't need to wait for mutex if someone is already calling
> > > these functions.)
> > > 
> > 
> > While discussing about XFS problem, I got a question. Is it OK (from point
> > of view of reentrant) to use mutex_lock() or mutex_lock_killable() inside
> > shrinker's entry point functions? Can senario shown below possible?
> > 
> > (1) kswapd is doing memory reclaim which does not need to hold mutex.
> > 
> > (2) Someone in GFP_KERNEL context (not kswapd) calls
> >     ttm_dma_pool_shrink_count() and then calls ttm_dma_pool_shrink_scan()
> >     from direct reclaim path.
> > 
> > (3) Inside ttm_dma_pool_shrink_scan(), GFP_KERNEL allocation is issued
> >     while mutex is held by the someone.
> > 
> > (4) GFP_KERNEL allocation cannot be completed immediately due to memory
> >     pressure.
> > 
> > (5) kswapd calls ttm_dma_pool_shrink_count() which need to hold mutex.
> > 
> > (6) Inside ttm_dma_pool_shrink_count(), kswapd is blocked waiting for
> >     mutex held by the someone, and the someone is waiting for GFP_KERNEL
> >     allocation to complete, but GFP_KERNEL allocation cannot be completed
> >     until mutex held by the someone is released?

Ewww. Perhaps if we used GFP_ATOMIC for the array allocation we do in
ttm_dma_page_pool_free and ttm_page_pool_free?

That would avoid the 4) problem.
> > 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] gpu/drm/ttm: Use mutex_lock_killable() for shrinker functions.
  2014-05-28 18:54       ` Konrad Rzeszutek Wilk
@ 2014-05-28 21:47         ` Tetsuo Handa
  2014-05-29 14:34           ` Tetsuo Handa
  2014-05-30 16:06           ` [PATCH] gpu/drm/ttm: Use mutex_lock_killable() for shrinker functions Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 21+ messages in thread
From: Tetsuo Handa @ 2014-05-28 21:47 UTC (permalink / raw)
  To: konrad.wilk
  Cc: dchinner, airlied, glommer, mgorman, linux-mm, linux-kernel, dri-devel

Konrad Rzeszutek Wilk wrote:
> On Sat, May 24, 2014 at 11:22:09PM +0900, Tetsuo Handa wrote:
> > Hello.
> > 
> > I tried to test whether it is OK (from point of view of reentrant) to use
> > mutex_lock() or mutex_lock_killable() inside shrinker functions when shrinker
> > functions do memory allocation, for drivers/gpu/drm/ttm/ttm_page_alloc_dma.c is
> > doing memory allocation with mutex lock held inside ttm_dma_pool_shrink_scan().
> > 
> > If I compile a test module shown below which mimics extreme case of what
> > ttm_dma_pool_shrink_scan() will do
> 
> And ttm_pool_shrink_scan.

I don't know why but ttm_pool_shrink_scan() does not take mutex.

> > and load the test module and do
> > 
> >   # echo 3 > /proc/sys/vm/drop_caches
> > 
> > the system stalls with 0% CPU usage because of mutex deadlock
> > (with prior lockdep warning).
> > 
> > Is this because wrong gfp flags are passed to kmalloc() ? Is this because
> > the test module's shrinker functions return wrong values? Is this because
> > doing memory allocation with mutex held inside shrinker functions is
> > forbidden? Can anybody tell me what is wrong with my test module?
> 
> What is the sc->gfp_flags? What if you use GFP_ATOMIC?
> 
I didn't check it but at least I'm sure that __GFP_WAIT bit is set.
Thus, GFP_ATOMIC or GFP_NOWAIT will solve this problem.

> In regards to the lockdep warning below it looks like
> > 
> > Regards.
> > 
> > [   48.077353] 
> > [   48.077999] =================================
> > [   48.080023] [ INFO: inconsistent lock state ]
> > [   48.080023] 3.15.0-rc6-00190-g1ee1cea #203 Tainted: G           OE
> > [   48.080023] ---------------------------------
> > [   48.080023] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
> > [   48.086745] kswapd0/784 [HC0[0]:SC0[0]:HE1:SE1] takes:
> > [   48.086745]  (lock#2){+.+.?.}, at: [<e0861022>] shrink_test_count+0x12/0x60 [test]
> > [   48.086745] {RECLAIM_FS-ON-W} state was registered at:
> 
> 
> You have the scenario you described below, that is:
> 
> shrink_test_scan	
> 	mutex_lock_killable()
> 		-> kmalloc
> 			-> shrink_test_count
> 				mutex_lock_killable()
> 
> And 'mutex_lock_killable' is the same (in at least this context)
> the same as 'mutex_lock'. In other words, your second 'mutex_lock'
> is going to spin - which is a deadlock.
> 
> Perhaps a way of not getting in this scenario is:
> 
>  1). Try to take the mutex (ie, one that won't spin if it can't
>      get it).
> 
>  2). Use the GFP_ATOMIC in the shrinker so that we never
>      end up calling ourselves in case of memory pressure
> 
> ?

Yes, I think so as well.

> > > > This patch changes "mutex_lock();" to "if (mutex_lock_killable()) return ...;"
> > > > so that any threads can promptly give up. (By the way, as far as I tested,
> > > > changing to "if (!mutex_trylock()) return ...;" likely shortens the duration
> > > > of stall. Maybe we don't need to wait for mutex if someone is already calling
> > > > these functions.)
> > > > 
> > > 
> > > While discussing about XFS problem, I got a question. Is it OK (from point
> > > of view of reentrant) to use mutex_lock() or mutex_lock_killable() inside
> > > shrinker's entry point functions? Can senario shown below possible?
> > > 
> > > (1) kswapd is doing memory reclaim which does not need to hold mutex.
> > > 
> > > (2) Someone in GFP_KERNEL context (not kswapd) calls
> > >     ttm_dma_pool_shrink_count() and then calls ttm_dma_pool_shrink_scan()
> > >     from direct reclaim path.
> > > 
> > > (3) Inside ttm_dma_pool_shrink_scan(), GFP_KERNEL allocation is issued
> > >     while mutex is held by the someone.
> > > 
> > > (4) GFP_KERNEL allocation cannot be completed immediately due to memory
> > >     pressure.
> > > 
> > > (5) kswapd calls ttm_dma_pool_shrink_count() which need to hold mutex.
> > > 
> > > (6) Inside ttm_dma_pool_shrink_count(), kswapd is blocked waiting for
> > >     mutex held by the someone, and the someone is waiting for GFP_KERNEL
> > >     allocation to complete, but GFP_KERNEL allocation cannot be completed
> > >     until mutex held by the someone is released?
> 
> Ewww. Perhaps if we used GFP_ATOMIC for the array allocation we do in
> ttm_dma_page_pool_free and ttm_page_pool_free?
> 
> That would avoid the 4) problem.

Right. Which approach ("use GFP_ATOMIC or GFP_NOWAIT" / "use !mutex_trylock()")
do you prefer? I'll create RHBZ entry for RHEL7 kernel as non count/scan
version has the same problem.

---------- test.c start ----------
#include <linux/module.h>
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/mm.h>

static int shrink_test(struct shrinker *shrinker, struct shrink_control *sc)
{
	static DEFINE_MUTEX(lock);
	LIST_HEAD(list);
	int i = 0;
	if (mutex_lock_killable(&lock)) {
		printk(KERN_WARNING "Process %u (%s) gave up waiting for mutex"
		       "\n", current->pid, current->comm);
		return 0;
	}
	while (1) {
		struct list_head *l = kmalloc(PAGE_SIZE, sc->gfp_mask);
		if (!l)
			break;
		list_add_tail(l, &list);
		i++;
	}
	printk(KERN_WARNING "Process %u (%s) allocated %u pages\n",
	       current->pid, current->comm, i);
	while (i--) {
		struct list_head *l = list.next;
		list_del(l);
		kfree(l);
	}
	mutex_unlock(&lock);
	return 0;
}

static struct shrinker recursive_shrinker = {
	.shrink = shrink_test,
	.seeks = DEFAULT_SEEKS,
};

static int __init recursive_shrinker_init(void)
{
	register_shrinker(&recursive_shrinker);
	return 0;
}

module_init(recursive_shrinker_init);
MODULE_LICENSE("GPL");
---------- test.c end ----------

[ 1263.179725] 
[ 1263.180756] =================================
[ 1263.182322] [ INFO: inconsistent lock state ]
[ 1263.183920] 3.10.0-121.el7.x86_64.debug #1 Tainted: GF          O--------------  
[ 1263.186162] ---------------------------------
[ 1263.187742] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
[ 1263.189788] kswapd0/105 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 1263.191523]  (lock#3){+.+.?.}, at: [<ffffffffa0563040>] shrink_test+0x40/0x140 [test]
[ 1263.194053] {RECLAIM_FS-ON-W} state was registered at:
[ 1263.195848]   [<ffffffff810ea759>] mark_held_locks+0xb9/0x140
[ 1263.197758]   [<ffffffff810ecb6a>] lockdep_trace_alloc+0x7a/0xe0
[ 1263.199718]   [<ffffffff811db9d3>] kmem_cache_alloc_trace+0x33/0x340
[ 1263.201809]   [<ffffffffa0563061>] shrink_test+0x61/0x140 [test]
[ 1263.203662]   [<ffffffff81194a99>] shrink_slab+0xb9/0x4d0
[ 1263.205378]   [<ffffffff81265403>] drop_caches_sysctl_handler+0xc3/0x120
[ 1263.207352]   [<ffffffff8127dab4>] proc_sys_call_handler+0xe4/0x110
[ 1263.209238]   [<ffffffff8127daf4>] proc_sys_write+0x14/0x20
[ 1263.210972]   [<ffffffff811fd1a0>] vfs_write+0xc0/0x1f0
[ 1263.212658]   [<ffffffff811fdc1b>] SyS_write+0x5b/0xb0
[ 1263.214301]   [<ffffffff816bd899>] system_call_fastpath+0x16/0x1b
[ 1263.216172] irq event stamp: 37
[ 1263.217406] hardirqs last  enabled at (37): [<ffffffff816b2f9c>] _raw_spin_unlock_irq+0x2c/0x50
[ 1263.219753] hardirqs last disabled at (36): [<ffffffff816b2dff>] _raw_spin_lock_irq+0x1f/0x90
[ 1263.222052] softirqs last  enabled at (0): [<ffffffff8106aa25>] copy_process.part.22+0x665/0x1750
[ 1263.224414] softirqs last disabled at (0): [<          (null)>]           (null)
[ 1263.226492] 
[ 1263.226492] other info that might help us debug this:
[ 1263.228920]  Possible unsafe locking scenario:
[ 1263.228920] 
[ 1263.231192]        CPU0
[ 1263.232223]        ----
[ 1263.233280]   lock(lock#3);
[ 1263.234435]   <Interrupt>
[ 1263.235489]     lock(lock#3);
[ 1263.236708] 
[ 1263.236708]  *** DEADLOCK ***
[ 1263.236708] 
[ 1263.239358] 1 lock held by kswapd0/105:
[ 1263.240593]  #0:  (shrinker_rwsem){++++.+}, at: [<ffffffff81194a1c>] shrink_slab+0x3c/0x4d0
[ 1263.242894] 
[ 1263.242894] stack backtrace:
[ 1263.244792] CPU: 1 PID: 105 Comm: kswapd0 Tainted: GF          O--------------   3.10.0-121.el7.x86_64.debug #1
[ 1263.247230] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 1263.249747]  ffff880036708000 000000004c6ef89a ffff8800367039c8 ffffffff816a981c
[ 1263.251849]  ffff880036703a18 ffffffff816a3ac5 0000000000000000 ffff880000000001
[ 1263.253956]  ffffffff00000001 000000000000000a ffff880036708000 ffffffff810e88a0
[ 1263.256314] Call Trace:
[ 1263.257365]  [<ffffffff816a981c>] dump_stack+0x19/0x1b
[ 1263.258921]  [<ffffffff816a3ac5>] print_usage_bug+0x1f7/0x208
[ 1263.260591]  [<ffffffff810e88a0>] ? check_usage_backwards+0x1b0/0x1b0
[ 1263.262379]  [<ffffffff810ea61d>] mark_lock+0x21d/0x2a0
[ 1263.263898]  [<ffffffff810eb30a>] __lock_acquire+0x52a/0xb60
[ 1263.265562]  [<ffffffff810232c9>] ? sched_clock+0x9/0x10
[ 1263.267148]  [<ffffffff810b7c75>] ? sched_clock_cpu+0xb5/0x100
[ 1263.268802]  [<ffffffff810ec132>] lock_acquire+0xa2/0x1f0
[ 1263.270378]  [<ffffffffa0563040>] ? shrink_test+0x40/0x140 [test]
[ 1263.272072]  [<ffffffff816ae859>] mutex_lock_killable_nested+0x99/0x5d0
[ 1263.273900]  [<ffffffffa0563040>] ? shrink_test+0x40/0x140 [test]
[ 1263.275610]  [<ffffffffa0563040>] ? shrink_test+0x40/0x140 [test]
[ 1263.277305]  [<ffffffffa0563040>] shrink_test+0x40/0x140 [test]
[ 1263.278970]  [<ffffffff81194a99>] shrink_slab+0xb9/0x4d0
[ 1263.280501]  [<ffffffff811991b9>] balance_pgdat+0x4e9/0x620
[ 1263.282135]  [<ffffffff811994a3>] kswapd+0x1b3/0x640
[ 1263.283604]  [<ffffffff8109f3c0>] ? wake_up_bit+0x30/0x30
[ 1263.285166]  [<ffffffff811992f0>] ? balance_pgdat+0x620/0x620
[ 1263.286798]  [<ffffffff8109e0cd>] kthread+0xed/0x100
[ 1263.288286]  [<ffffffff8109dfe0>] ? insert_kthread_work+0x80/0x80
[ 1263.289973]  [<ffffffff816bd7ec>] ret_from_fork+0x7c/0xb0
[ 1263.291535]  [<ffffffff8109dfe0>] ? insert_kthread_work+0x80/0x80

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] gpu/drm/ttm: Use mutex_lock_killable() for shrinker functions.
  2014-05-28 21:47         ` Tetsuo Handa
@ 2014-05-29 14:34           ` Tetsuo Handa
  2014-05-30 16:08             ` Konrad Rzeszutek Wilk
  2014-05-30 16:06           ` [PATCH] gpu/drm/ttm: Use mutex_lock_killable() for shrinker functions Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2014-05-29 14:34 UTC (permalink / raw)
  To: konrad.wilk
  Cc: dchinner, airlied, glommer, mgorman, linux-mm, linux-kernel, dri-devel

Tetsuo Handa wrote:
> Konrad Rzeszutek Wilk wrote:
> > On Sat, May 24, 2014 at 11:22:09PM +0900, Tetsuo Handa wrote:
> > > Hello.
> > > 
> > > I tried to test whether it is OK (from point of view of reentrant) to use
> > > mutex_lock() or mutex_lock_killable() inside shrinker functions when shrinker
> > > functions do memory allocation, for drivers/gpu/drm/ttm/ttm_page_alloc_dma.c is
> > > doing memory allocation with mutex lock held inside ttm_dma_pool_shrink_scan().
> > > 
> > > If I compile a test module shown below which mimics extreme case of what
> > > ttm_dma_pool_shrink_scan() will do
> > 
> > And ttm_pool_shrink_scan.
> 
> I don't know why but ttm_pool_shrink_scan() does not take mutex.
> 
Well, it seems to me that ttm_pool_shrink_scan() not taking mutex is a bug
which could lead to stack overflow if kmalloc() in ttm_page_pool_free()
triggered recursion.

  shrink_slab()
  => ttm_pool_shrink_scan()
     => ttm_page_pool_free()
        => kmalloc(GFP_KERNEL)
           => shrink_slab()
              => ttm_pool_shrink_scan()
                 => ttm_page_pool_free()
                    => kmalloc(GFP_KERNEL)

Maybe shrink_slab() should be updated not to call same shrinker in parallel?

Also, it seems to me that ttm_dma_pool_shrink_scan() has potential division
by 0 bug as described below. Is this patch correct?
----------
>From 4a65744a300e14e5e202c5f13ba2759e1e797d29 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Thu, 29 May 2014 18:25:42 +0900
Subject: [PATCH] gpu/drm/ttm: Use mutex_trylock() for shrinker functions.

I can observe that RHEL7 environment stalls with 100% CPU usage when a
certain type of memory pressure is given. While the shrinker functions
are called by shrink_slab() before the OOM killer is triggered, the stall
lasts for many minutes.

One of reasons of this stall is that
ttm_dma_pool_shrink_count()/ttm_dma_pool_shrink_scan() are called and
are blocked at mutex_lock(&_manager->lock). GFP_KERNEL allocation with
_manager->lock held causes someone (including kswapd) to deadlock when
these functions are called due to memory pressure. This patch changes
"mutex_lock();" to "if (!mutex_trylock()) return ...;" in order to
avoid deadlock.

At the same time, this patch fixes potential division by 0 due to
unconditionally doing "% _manager->npools". This is because
list_empty(&_manager->pools) being false does not guarantee that
_manager->npools != 0 after taking the _manager->lock because
_manager->npools is updated under the _manager->lock.

At the same time, this patch moves updating of start_pool variable
in order to avoid skipping when choosing a pool to shrink in
round-robin style. The start_pool is changed from "atomic_t" to
"unsigned int" because it is now updated under the _manager->lock.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: stable <stable@kernel.org> [3.3+]
---
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
index fb8259f..5e332b4 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
@@ -1004,9 +1004,9 @@ EXPORT_SYMBOL_GPL(ttm_dma_unpopulate);
 static unsigned long
 ttm_dma_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
-	static atomic_t start_pool = ATOMIC_INIT(0);
+	static unsigned int start_pool;
 	unsigned idx = 0;
-	unsigned pool_offset = atomic_add_return(1, &start_pool);
+	unsigned pool_offset;
 	unsigned shrink_pages = sc->nr_to_scan;
 	struct device_pools *p;
 	unsigned long freed = 0;
@@ -1014,8 +1014,11 @@ ttm_dma_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 	if (list_empty(&_manager->pools))
 		return SHRINK_STOP;
 
-	mutex_lock(&_manager->lock);
-	pool_offset = pool_offset % _manager->npools;
+	if (!mutex_trylock(&_manager->lock))
+		return SHRINK_STOP;
+	if (!_manager->npools)
+		goto out;
+	pool_offset = ++start_pool % _manager->npools;
 	list_for_each_entry(p, &_manager->pools, pools) {
 		unsigned nr_free;
 
@@ -1034,6 +1037,7 @@ ttm_dma_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 			 p->pool->dev_name, p->pool->name, current->pid,
 			 nr_free, shrink_pages);
 	}
+out:
 	mutex_unlock(&_manager->lock);
 	return freed;
 }
@@ -1044,7 +1048,8 @@ ttm_dma_pool_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
 	struct device_pools *p;
 	unsigned long count = 0;
 
-	mutex_lock(&_manager->lock);
+	if (!mutex_trylock(&_manager->lock))
+		return 0;
 	list_for_each_entry(p, &_manager->pools, pools)
 		count += p->pool->npages_free;
 	mutex_unlock(&_manager->lock);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] gpu/drm/ttm: Use mutex_lock_killable() for shrinker functions.
  2014-05-28 21:47         ` Tetsuo Handa
  2014-05-29 14:34           ` Tetsuo Handa
@ 2014-05-30 16:06           ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 21+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-05-30 16:06 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: dchinner, airlied, glommer, mgorman, linux-mm, linux-kernel, dri-devel

On Thu, May 29, 2014 at 06:47:49AM +0900, Tetsuo Handa wrote:
> Konrad Rzeszutek Wilk wrote:
> > On Sat, May 24, 2014 at 11:22:09PM +0900, Tetsuo Handa wrote:
> > > Hello.
> > > 
> > > I tried to test whether it is OK (from point of view of reentrant) to use
> > > mutex_lock() or mutex_lock_killable() inside shrinker functions when shrinker
> > > functions do memory allocation, for drivers/gpu/drm/ttm/ttm_page_alloc_dma.c is
> > > doing memory allocation with mutex lock held inside ttm_dma_pool_shrink_scan().
> > > 
> > > If I compile a test module shown below which mimics extreme case of what
> > > ttm_dma_pool_shrink_scan() will do
> > 
> > And ttm_pool_shrink_scan.
> 
> I don't know why but ttm_pool_shrink_scan() does not take mutex.
> 
> > > and load the test module and do
> > > 
> > >   # echo 3 > /proc/sys/vm/drop_caches
> > > 
> > > the system stalls with 0% CPU usage because of mutex deadlock
> > > (with prior lockdep warning).
> > > 
> > > Is this because wrong gfp flags are passed to kmalloc() ? Is this because
> > > the test module's shrinker functions return wrong values? Is this because
> > > doing memory allocation with mutex held inside shrinker functions is
> > > forbidden? Can anybody tell me what is wrong with my test module?
> > 
> > What is the sc->gfp_flags? What if you use GFP_ATOMIC?
> > 
> I didn't check it but at least I'm sure that __GFP_WAIT bit is set.
> Thus, GFP_ATOMIC or GFP_NOWAIT will solve this problem.
> 
> > In regards to the lockdep warning below it looks like
> > > 
> > > Regards.
> > > 
> > > [   48.077353] 
> > > [   48.077999] =================================
> > > [   48.080023] [ INFO: inconsistent lock state ]
> > > [   48.080023] 3.15.0-rc6-00190-g1ee1cea #203 Tainted: G           OE
> > > [   48.080023] ---------------------------------
> > > [   48.080023] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
> > > [   48.086745] kswapd0/784 [HC0[0]:SC0[0]:HE1:SE1] takes:
> > > [   48.086745]  (lock#2){+.+.?.}, at: [<e0861022>] shrink_test_count+0x12/0x60 [test]
> > > [   48.086745] {RECLAIM_FS-ON-W} state was registered at:
> > 
> > 
> > You have the scenario you described below, that is:
> > 
> > shrink_test_scan	
> > 	mutex_lock_killable()
> > 		-> kmalloc
> > 			-> shrink_test_count
> > 				mutex_lock_killable()
> > 
> > And 'mutex_lock_killable' is the same (in at least this context)
> > the same as 'mutex_lock'. In other words, your second 'mutex_lock'
> > is going to spin - which is a deadlock.
> > 
> > Perhaps a way of not getting in this scenario is:
> > 
> >  1). Try to take the mutex (ie, one that won't spin if it can't
> >      get it).
> > 
> >  2). Use the GFP_ATOMIC in the shrinker so that we never
> >      end up calling ourselves in case of memory pressure
> > 
> > ?
> 
> Yes, I think so as well.
> 
> > > > > This patch changes "mutex_lock();" to "if (mutex_lock_killable()) return ...;"
> > > > > so that any threads can promptly give up. (By the way, as far as I tested,
> > > > > changing to "if (!mutex_trylock()) return ...;" likely shortens the duration
> > > > > of stall. Maybe we don't need to wait for mutex if someone is already calling
> > > > > these functions.)
> > > > > 
> > > > 
> > > > While discussing about XFS problem, I got a question. Is it OK (from point
> > > > of view of reentrant) to use mutex_lock() or mutex_lock_killable() inside
> > > > shrinker's entry point functions? Can senario shown below possible?
> > > > 
> > > > (1) kswapd is doing memory reclaim which does not need to hold mutex.
> > > > 
> > > > (2) Someone in GFP_KERNEL context (not kswapd) calls
> > > >     ttm_dma_pool_shrink_count() and then calls ttm_dma_pool_shrink_scan()
> > > >     from direct reclaim path.
> > > > 
> > > > (3) Inside ttm_dma_pool_shrink_scan(), GFP_KERNEL allocation is issued
> > > >     while mutex is held by the someone.
> > > > 
> > > > (4) GFP_KERNEL allocation cannot be completed immediately due to memory
> > > >     pressure.
> > > > 
> > > > (5) kswapd calls ttm_dma_pool_shrink_count() which need to hold mutex.
> > > > 
> > > > (6) Inside ttm_dma_pool_shrink_count(), kswapd is blocked waiting for
> > > >     mutex held by the someone, and the someone is waiting for GFP_KERNEL
> > > >     allocation to complete, but GFP_KERNEL allocation cannot be completed
> > > >     until mutex held by the someone is released?
> > 
> > Ewww. Perhaps if we used GFP_ATOMIC for the array allocation we do in
> > ttm_dma_page_pool_free and ttm_page_pool_free?
> > 
> > That would avoid the 4) problem.
> 
> Right. Which approach ("use GFP_ATOMIC or GFP_NOWAIT" / "use !mutex_trylock()")
> do you prefer? I'll create RHBZ entry for RHEL7 kernel as non count/scan
> version has the same problem.

I am not sure why you need an RHBZ - as the patches that go upstream
don't need an RHBZ.

I think the combination of mutex_trylock and GFP_ATOMIC would suffice.

Thank you!
> 
> ---------- test.c start ----------
> #include <linux/module.h>
> #include <linux/sched.h>
> #include <linux/slab.h>
> #include <linux/mm.h>
> 
> static int shrink_test(struct shrinker *shrinker, struct shrink_control *sc)
> {
> 	static DEFINE_MUTEX(lock);
> 	LIST_HEAD(list);
> 	int i = 0;
> 	if (mutex_lock_killable(&lock)) {
> 		printk(KERN_WARNING "Process %u (%s) gave up waiting for mutex"
> 		       "\n", current->pid, current->comm);
> 		return 0;
> 	}
> 	while (1) {
> 		struct list_head *l = kmalloc(PAGE_SIZE, sc->gfp_mask);
> 		if (!l)
> 			break;
> 		list_add_tail(l, &list);
> 		i++;
> 	}
> 	printk(KERN_WARNING "Process %u (%s) allocated %u pages\n",
> 	       current->pid, current->comm, i);
> 	while (i--) {
> 		struct list_head *l = list.next;
> 		list_del(l);
> 		kfree(l);
> 	}
> 	mutex_unlock(&lock);
> 	return 0;
> }
> 
> static struct shrinker recursive_shrinker = {
> 	.shrink = shrink_test,
> 	.seeks = DEFAULT_SEEKS,
> };
> 
> static int __init recursive_shrinker_init(void)
> {
> 	register_shrinker(&recursive_shrinker);
> 	return 0;
> }
> 
> module_init(recursive_shrinker_init);
> MODULE_LICENSE("GPL");
> ---------- test.c end ----------
> 
> [ 1263.179725] 
> [ 1263.180756] =================================
> [ 1263.182322] [ INFO: inconsistent lock state ]
> [ 1263.183920] 3.10.0-121.el7.x86_64.debug #1 Tainted: GF          O--------------  
> [ 1263.186162] ---------------------------------
> [ 1263.187742] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
> [ 1263.189788] kswapd0/105 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [ 1263.191523]  (lock#3){+.+.?.}, at: [<ffffffffa0563040>] shrink_test+0x40/0x140 [test]
> [ 1263.194053] {RECLAIM_FS-ON-W} state was registered at:
> [ 1263.195848]   [<ffffffff810ea759>] mark_held_locks+0xb9/0x140
> [ 1263.197758]   [<ffffffff810ecb6a>] lockdep_trace_alloc+0x7a/0xe0
> [ 1263.199718]   [<ffffffff811db9d3>] kmem_cache_alloc_trace+0x33/0x340
> [ 1263.201809]   [<ffffffffa0563061>] shrink_test+0x61/0x140 [test]
> [ 1263.203662]   [<ffffffff81194a99>] shrink_slab+0xb9/0x4d0
> [ 1263.205378]   [<ffffffff81265403>] drop_caches_sysctl_handler+0xc3/0x120
> [ 1263.207352]   [<ffffffff8127dab4>] proc_sys_call_handler+0xe4/0x110
> [ 1263.209238]   [<ffffffff8127daf4>] proc_sys_write+0x14/0x20
> [ 1263.210972]   [<ffffffff811fd1a0>] vfs_write+0xc0/0x1f0
> [ 1263.212658]   [<ffffffff811fdc1b>] SyS_write+0x5b/0xb0
> [ 1263.214301]   [<ffffffff816bd899>] system_call_fastpath+0x16/0x1b
> [ 1263.216172] irq event stamp: 37
> [ 1263.217406] hardirqs last  enabled at (37): [<ffffffff816b2f9c>] _raw_spin_unlock_irq+0x2c/0x50
> [ 1263.219753] hardirqs last disabled at (36): [<ffffffff816b2dff>] _raw_spin_lock_irq+0x1f/0x90
> [ 1263.222052] softirqs last  enabled at (0): [<ffffffff8106aa25>] copy_process.part.22+0x665/0x1750
> [ 1263.224414] softirqs last disabled at (0): [<          (null)>]           (null)
> [ 1263.226492] 
> [ 1263.226492] other info that might help us debug this:
> [ 1263.228920]  Possible unsafe locking scenario:
> [ 1263.228920] 
> [ 1263.231192]        CPU0
> [ 1263.232223]        ----
> [ 1263.233280]   lock(lock#3);
> [ 1263.234435]   <Interrupt>
> [ 1263.235489]     lock(lock#3);
> [ 1263.236708] 
> [ 1263.236708]  *** DEADLOCK ***
> [ 1263.236708] 
> [ 1263.239358] 1 lock held by kswapd0/105:
> [ 1263.240593]  #0:  (shrinker_rwsem){++++.+}, at: [<ffffffff81194a1c>] shrink_slab+0x3c/0x4d0
> [ 1263.242894] 
> [ 1263.242894] stack backtrace:
> [ 1263.244792] CPU: 1 PID: 105 Comm: kswapd0 Tainted: GF          O--------------   3.10.0-121.el7.x86_64.debug #1
> [ 1263.247230] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
> [ 1263.249747]  ffff880036708000 000000004c6ef89a ffff8800367039c8 ffffffff816a981c
> [ 1263.251849]  ffff880036703a18 ffffffff816a3ac5 0000000000000000 ffff880000000001
> [ 1263.253956]  ffffffff00000001 000000000000000a ffff880036708000 ffffffff810e88a0
> [ 1263.256314] Call Trace:
> [ 1263.257365]  [<ffffffff816a981c>] dump_stack+0x19/0x1b
> [ 1263.258921]  [<ffffffff816a3ac5>] print_usage_bug+0x1f7/0x208
> [ 1263.260591]  [<ffffffff810e88a0>] ? check_usage_backwards+0x1b0/0x1b0
> [ 1263.262379]  [<ffffffff810ea61d>] mark_lock+0x21d/0x2a0
> [ 1263.263898]  [<ffffffff810eb30a>] __lock_acquire+0x52a/0xb60
> [ 1263.265562]  [<ffffffff810232c9>] ? sched_clock+0x9/0x10
> [ 1263.267148]  [<ffffffff810b7c75>] ? sched_clock_cpu+0xb5/0x100
> [ 1263.268802]  [<ffffffff810ec132>] lock_acquire+0xa2/0x1f0
> [ 1263.270378]  [<ffffffffa0563040>] ? shrink_test+0x40/0x140 [test]
> [ 1263.272072]  [<ffffffff816ae859>] mutex_lock_killable_nested+0x99/0x5d0
> [ 1263.273900]  [<ffffffffa0563040>] ? shrink_test+0x40/0x140 [test]
> [ 1263.275610]  [<ffffffffa0563040>] ? shrink_test+0x40/0x140 [test]
> [ 1263.277305]  [<ffffffffa0563040>] shrink_test+0x40/0x140 [test]
> [ 1263.278970]  [<ffffffff81194a99>] shrink_slab+0xb9/0x4d0
> [ 1263.280501]  [<ffffffff811991b9>] balance_pgdat+0x4e9/0x620
> [ 1263.282135]  [<ffffffff811994a3>] kswapd+0x1b3/0x640
> [ 1263.283604]  [<ffffffff8109f3c0>] ? wake_up_bit+0x30/0x30
> [ 1263.285166]  [<ffffffff811992f0>] ? balance_pgdat+0x620/0x620
> [ 1263.286798]  [<ffffffff8109e0cd>] kthread+0xed/0x100
> [ 1263.288286]  [<ffffffff8109dfe0>] ? insert_kthread_work+0x80/0x80
> [ 1263.289973]  [<ffffffff816bd7ec>] ret_from_fork+0x7c/0xb0
> [ 1263.291535]  [<ffffffff8109dfe0>] ? insert_kthread_work+0x80/0x80

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] gpu/drm/ttm: Use mutex_lock_killable() for shrinker functions.
  2014-05-29 14:34           ` Tetsuo Handa
@ 2014-05-30 16:08             ` Konrad Rzeszutek Wilk
  2014-05-31  2:58               ` [PATCH 1/5] gpu/drm/ttm: Fix possible division by 0 in ttm_dma_pool_shrink_scan() Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-05-30 16:08 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: dchinner, airlied, glommer, mgorman, linux-mm, linux-kernel, dri-devel

On Thu, May 29, 2014 at 11:34:59PM +0900, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > Konrad Rzeszutek Wilk wrote:
> > > On Sat, May 24, 2014 at 11:22:09PM +0900, Tetsuo Handa wrote:
> > > > Hello.
> > > > 
> > > > I tried to test whether it is OK (from point of view of reentrant) to use
> > > > mutex_lock() or mutex_lock_killable() inside shrinker functions when shrinker
> > > > functions do memory allocation, for drivers/gpu/drm/ttm/ttm_page_alloc_dma.c is
> > > > doing memory allocation with mutex lock held inside ttm_dma_pool_shrink_scan().
> > > > 
> > > > If I compile a test module shown below which mimics extreme case of what
> > > > ttm_dma_pool_shrink_scan() will do
> > > 
> > > And ttm_pool_shrink_scan.
> > 
> > I don't know why but ttm_pool_shrink_scan() does not take mutex.
> > 
> Well, it seems to me that ttm_pool_shrink_scan() not taking mutex is a bug
> which could lead to stack overflow if kmalloc() in ttm_page_pool_free()
> triggered recursion.
> 
>   shrink_slab()
>   => ttm_pool_shrink_scan()
>      => ttm_page_pool_free()
>         => kmalloc(GFP_KERNEL)
>            => shrink_slab()
>               => ttm_pool_shrink_scan()
>                  => ttm_page_pool_free()
>                     => kmalloc(GFP_KERNEL)
> 
> Maybe shrink_slab() should be updated not to call same shrinker in parallel?
> 
> Also, it seems to me that ttm_dma_pool_shrink_scan() has potential division
> by 0 bug as described below. Is this patch correct?

Looks OK. I would need to test it first. Could you send both patches
to me please so I can just test them and queue them up together?

Thank you!
> ----------
> >From 4a65744a300e14e5e202c5f13ba2759e1e797d29 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Thu, 29 May 2014 18:25:42 +0900
> Subject: [PATCH] gpu/drm/ttm: Use mutex_trylock() for shrinker functions.
> 
> I can observe that RHEL7 environment stalls with 100% CPU usage when a
> certain type of memory pressure is given. While the shrinker functions
> are called by shrink_slab() before the OOM killer is triggered, the stall
> lasts for many minutes.
> 
> One of reasons of this stall is that
> ttm_dma_pool_shrink_count()/ttm_dma_pool_shrink_scan() are called and
> are blocked at mutex_lock(&_manager->lock). GFP_KERNEL allocation with
> _manager->lock held causes someone (including kswapd) to deadlock when
> these functions are called due to memory pressure. This patch changes
> "mutex_lock();" to "if (!mutex_trylock()) return ...;" in order to
> avoid deadlock.
> 
> At the same time, this patch fixes potential division by 0 due to
> unconditionally doing "% _manager->npools". This is because
> list_empty(&_manager->pools) being false does not guarantee that
> _manager->npools != 0 after taking the _manager->lock because
> _manager->npools is updated under the _manager->lock.
> 
> At the same time, this patch moves updating of start_pool variable
> in order to avoid skipping when choosing a pool to shrink in
> round-robin style. The start_pool is changed from "atomic_t" to
> "unsigned int" because it is now updated under the _manager->lock.
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: stable <stable@kernel.org> [3.3+]
> ---
>  drivers/gpu/drm/ttm/ttm_page_alloc_dma.c | 15 ++++++++++-----
>  1 file changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
> index fb8259f..5e332b4 100644
> --- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
> +++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
> @@ -1004,9 +1004,9 @@ EXPORT_SYMBOL_GPL(ttm_dma_unpopulate);
>  static unsigned long
>  ttm_dma_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
>  {
> -	static atomic_t start_pool = ATOMIC_INIT(0);
> +	static unsigned int start_pool;
>  	unsigned idx = 0;
> -	unsigned pool_offset = atomic_add_return(1, &start_pool);
> +	unsigned pool_offset;
>  	unsigned shrink_pages = sc->nr_to_scan;
>  	struct device_pools *p;
>  	unsigned long freed = 0;
> @@ -1014,8 +1014,11 @@ ttm_dma_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
>  	if (list_empty(&_manager->pools))
>  		return SHRINK_STOP;
>  
> -	mutex_lock(&_manager->lock);
> -	pool_offset = pool_offset % _manager->npools;
> +	if (!mutex_trylock(&_manager->lock))
> +		return SHRINK_STOP;
> +	if (!_manager->npools)
> +		goto out;
> +	pool_offset = ++start_pool % _manager->npools;
>  	list_for_each_entry(p, &_manager->pools, pools) {
>  		unsigned nr_free;
>  
> @@ -1034,6 +1037,7 @@ ttm_dma_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
>  			 p->pool->dev_name, p->pool->name, current->pid,
>  			 nr_free, shrink_pages);
>  	}
> +out:
>  	mutex_unlock(&_manager->lock);
>  	return freed;
>  }
> @@ -1044,7 +1048,8 @@ ttm_dma_pool_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
>  	struct device_pools *p;
>  	unsigned long count = 0;
>  
> -	mutex_lock(&_manager->lock);
> +	if (!mutex_trylock(&_manager->lock))
> +		return 0;
>  	list_for_each_entry(p, &_manager->pools, pools)
>  		count += p->pool->npages_free;
>  	mutex_unlock(&_manager->lock);
> -- 
> 1.8.3.1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/5] gpu/drm/ttm: Fix possible division by 0 in ttm_dma_pool_shrink_scan().
  2014-05-30 16:08             ` Konrad Rzeszutek Wilk
@ 2014-05-31  2:58               ` Tetsuo Handa
  2014-05-31  2:59                 ` [PATCH 2/5] gpu/drm/ttm: Choose a pool to shrink correctly " Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2014-05-31  2:58 UTC (permalink / raw)
  To: konrad.wilk
  Cc: dchinner, airlied, glommer, mgorman, linux-mm, linux-kernel, dri-devel

>From c1af6a76f8566eeeed049d3cf24635a43b4a83a6 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sat, 31 May 2014 09:39:22 +0900
Subject: [PATCH 1/5] gpu/drm/ttm: Fix possible division by 0 in ttm_dma_pool_shrink_scan().

list_empty(&_manager->pools) being false before taking _manager->lock
does not guarantee that _manager->npools != 0 after taking _manager->lock
because _manager->npools is updated under _manager->lock.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: stable <stable@kernel.org> [3.3+]
---
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
index fb8259f..b751fff 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
@@ -1015,6 +1015,8 @@ ttm_dma_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 		return SHRINK_STOP;
 
 	mutex_lock(&_manager->lock);
+	if (!_manager->npools)
+		goto out;
 	pool_offset = pool_offset % _manager->npools;
 	list_for_each_entry(p, &_manager->pools, pools) {
 		unsigned nr_free;
@@ -1034,6 +1036,7 @@ ttm_dma_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 			 p->pool->dev_name, p->pool->name, current->pid,
 			 nr_free, shrink_pages);
 	}
+out:
 	mutex_unlock(&_manager->lock);
 	return freed;
 }
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/5] gpu/drm/ttm: Choose a pool to shrink correctly in ttm_dma_pool_shrink_scan().
  2014-05-31  2:58               ` [PATCH 1/5] gpu/drm/ttm: Fix possible division by 0 in ttm_dma_pool_shrink_scan() Tetsuo Handa
@ 2014-05-31  2:59                 ` Tetsuo Handa
  2014-05-31  3:00                   ` [PATCH 3/5] gpu/drm/ttm: Use mutex_trylock() to avoid deadlock inside shrinker functions Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2014-05-31  2:59 UTC (permalink / raw)
  To: konrad.wilk
  Cc: dchinner, airlied, glommer, mgorman, linux-mm, linux-kernel, dri-devel

>From 19927a63c5d2dcda467373c31d810be42e40e190 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sat, 31 May 2014 09:47:02 +0900
Subject: [PATCH 2/5] gpu/drm/ttm: Choose a pool to shrink correctly in ttm_dma_pool_shrink_scan().

We can use "unsigned int" instead of "atomic_t" by updating start_pool
variable under _manager->lock. This patch will make it possible to avoid
skipping when choosing a pool to shrink in round-robin style, after next
patch changes mutex_lock(_manager->lock) to !mutex_trylock(_manager->lork).

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: stable <stable@kernel.org> [3.3+]
---
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
index b751fff..d8e59f7 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
@@ -1004,9 +1004,9 @@ EXPORT_SYMBOL_GPL(ttm_dma_unpopulate);
 static unsigned long
 ttm_dma_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
-	static atomic_t start_pool = ATOMIC_INIT(0);
+	static unsigned start_pool;
 	unsigned idx = 0;
-	unsigned pool_offset = atomic_add_return(1, &start_pool);
+	unsigned pool_offset;
 	unsigned shrink_pages = sc->nr_to_scan;
 	struct device_pools *p;
 	unsigned long freed = 0;
@@ -1017,7 +1017,7 @@ ttm_dma_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 	mutex_lock(&_manager->lock);
 	if (!_manager->npools)
 		goto out;
-	pool_offset = pool_offset % _manager->npools;
+	pool_offset = ++start_pool % _manager->npools;
 	list_for_each_entry(p, &_manager->pools, pools) {
 		unsigned nr_free;
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/5] gpu/drm/ttm: Use mutex_trylock() to avoid deadlock inside shrinker functions.
  2014-05-31  2:59                 ` [PATCH 2/5] gpu/drm/ttm: Choose a pool to shrink correctly " Tetsuo Handa
@ 2014-05-31  3:00                   ` Tetsuo Handa
  2014-05-31  3:01                     ` [PATCH 4/5] gpu/drm/ttm: Fix possible stack overflow by recursive shrinker calls Tetsuo Handa
  2014-06-10 19:17                     ` [PATCH 3/5] gpu/drm/ttm: Use mutex_trylock() to avoid deadlock inside shrinker functions Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 21+ messages in thread
From: Tetsuo Handa @ 2014-05-31  3:00 UTC (permalink / raw)
  To: konrad.wilk
  Cc: dchinner, airlied, glommer, mgorman, linux-mm, linux-kernel, dri-devel

>From 4e8d1a83629c5966bfd401c5f2187355624194f2 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sat, 31 May 2014 09:59:44 +0900
Subject: [PATCH 3/5] gpu/drm/ttm: Use mutex_trylock() to avoid deadlock inside shrinker functions.

I can observe that RHEL7 environment stalls with 100% CPU usage when a
certain type of memory pressure is given. While the shrinker functions
are called by shrink_slab() before the OOM killer is triggered, the stall
lasts for many minutes.

One of reasons of this stall is that
ttm_dma_pool_shrink_count()/ttm_dma_pool_shrink_scan() are called and
are blocked at mutex_lock(&_manager->lock). GFP_KERNEL allocation with
_manager->lock held causes someone (including kswapd) to deadlock when
these functions are called due to memory pressure. This patch changes
"mutex_lock();" to "if (!mutex_trylock()) return ...;" in order to
avoid deadlock.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: stable <stable@kernel.org> [3.3+]
---
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
index d8e59f7..620da39 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
@@ -1014,7 +1014,8 @@ ttm_dma_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 	if (list_empty(&_manager->pools))
 		return SHRINK_STOP;
 
-	mutex_lock(&_manager->lock);
+	if (!mutex_lock(&_manager->lock))
+		return SHRINK_STOP;
 	if (!_manager->npools)
 		goto out;
 	pool_offset = ++start_pool % _manager->npools;
@@ -1047,7 +1048,8 @@ ttm_dma_pool_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
 	struct device_pools *p;
 	unsigned long count = 0;
 
-	mutex_lock(&_manager->lock);
+	if (!mutex_trylock(&_manager->lock))
+		return 0;
 	list_for_each_entry(p, &_manager->pools, pools)
 		count += p->pool->npages_free;
 	mutex_unlock(&_manager->lock);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/5] gpu/drm/ttm: Fix possible stack overflow by recursive shrinker calls.
  2014-05-31  3:00                   ` [PATCH 3/5] gpu/drm/ttm: Use mutex_trylock() to avoid deadlock inside shrinker functions Tetsuo Handa
@ 2014-05-31  3:01                     ` Tetsuo Handa
  2014-05-31  3:02                       ` [PATCH 5/5] gpu/drm/ttm: Pass GFP flags in order to avoid deadlock Tetsuo Handa
  2014-06-10 19:17                     ` [PATCH 3/5] gpu/drm/ttm: Use mutex_trylock() to avoid deadlock inside shrinker functions Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2014-05-31  3:01 UTC (permalink / raw)
  To: konrad.wilk
  Cc: dchinner, airlied, glommer, mgorman, linux-mm, linux-kernel, dri-devel

>From d960cdf1e1c91172b86ab9517e576e5fb7e71785 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sat, 31 May 2014 10:05:02 +0900
Subject: [PATCH 4/5] gpu/drm/ttm: Fix possible stack overflow by recursive shrinker calls.

While ttm_dma_pool_shrink_scan() tries to take mutex before doing GFP_KERNEL
allocation, ttm_pool_shrink_scan() does not do it. This can result in stack
overflow if kmalloc() in ttm_page_pool_free() triggered recursion due to
memory pressure.

  shrink_slab()
  => ttm_pool_shrink_scan()
     => ttm_page_pool_free()
        => kmalloc(GFP_KERNEL)
           => shrink_slab()
              => ttm_pool_shrink_scan()
                 => ttm_page_pool_free()
                    => kmalloc(GFP_KERNEL)

Change ttm_pool_shrink_scan() to do like ttm_dma_pool_shrink_scan() does.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: stable <stable@kernel.org> [2.6.35+]
---
 drivers/gpu/drm/ttm/ttm_page_alloc.c |   10 +++++++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c b/drivers/gpu/drm/ttm/ttm_page_alloc.c
index 863bef9..deba59b 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
@@ -391,14 +391,17 @@ out:
 static unsigned long
 ttm_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
-	static atomic_t start_pool = ATOMIC_INIT(0);
+	static DEFINE_MUTEX(lock);
+	static unsigned start_pool;
 	unsigned i;
-	unsigned pool_offset = atomic_add_return(1, &start_pool);
+	unsigned pool_offset;
 	struct ttm_page_pool *pool;
 	int shrink_pages = sc->nr_to_scan;
 	unsigned long freed = 0;
 
-	pool_offset = pool_offset % NUM_POOLS;
+	if (!mutex_trylock(&lock))
+		return SHRINK_STOP;
+	pool_offset = ++start_pool % NUM_POOLS;
 	/* select start pool in round robin fashion */
 	for (i = 0; i < NUM_POOLS; ++i) {
 		unsigned nr_free = shrink_pages;
@@ -408,6 +411,7 @@ ttm_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 		shrink_pages = ttm_page_pool_free(pool, nr_free);
 		freed += nr_free - shrink_pages;
 	}
+	mutex_unlock(&lock);
 	return freed;
 }
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/5] gpu/drm/ttm: Pass GFP flags in order to avoid deadlock.
  2014-05-31  3:01                     ` [PATCH 4/5] gpu/drm/ttm: Fix possible stack overflow by recursive shrinker calls Tetsuo Handa
@ 2014-05-31  3:02                       ` Tetsuo Handa
  0 siblings, 0 replies; 21+ messages in thread
From: Tetsuo Handa @ 2014-05-31  3:02 UTC (permalink / raw)
  To: konrad.wilk
  Cc: dchinner, airlied, glommer, mgorman, linux-mm, linux-kernel,
	dri-devel, penguin-kernel

>From 357d4ba36284eee4c2a99085450a9039735767c6 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sat, 31 May 2014 10:22:17 +0900
Subject: [PATCH 5/5] gpu/drm/ttm: Pass GFP flags in order to avoid deadlock.

Commit 7dc19d5a "drivers: convert shrinkers to new count/scan API" added
deadlock warnings that ttm_page_pool_free() and ttm_dma_page_pool_free()
are currently doing GFP_KERNEL allocation.

But these functions did not get updated to receive gfp_t argument.
This patch explicitly passes sc->gfp_mask or GFP_KERNEL to these functions,
and removes the deadlock warning.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: stable <stable@kernel.org> [2.6.35+]
---
 drivers/gpu/drm/ttm/ttm_page_alloc.c     |   19 ++++++++++---------
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c |   19 +++++++++----------
 2 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c b/drivers/gpu/drm/ttm/ttm_page_alloc.c
index deba59b..cf4bad2 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
@@ -297,8 +297,10 @@ static void ttm_pool_update_free_locked(struct ttm_page_pool *pool,
  *
  * @pool: to free the pages from
  * @free_all: If set to true will free all pages in pool
+ * @gfp: GFP flags.
  **/
-static int ttm_page_pool_free(struct ttm_page_pool *pool, unsigned nr_free)
+static int ttm_page_pool_free(struct ttm_page_pool *pool, unsigned nr_free,
+			      gfp_t gfp)
 {
 	unsigned long irq_flags;
 	struct page *p;
@@ -309,8 +311,7 @@ static int ttm_page_pool_free(struct ttm_page_pool *pool, unsigned nr_free)
 	if (NUM_PAGES_TO_ALLOC < nr_free)
 		npages_to_free = NUM_PAGES_TO_ALLOC;
 
-	pages_to_free = kmalloc(npages_to_free * sizeof(struct page *),
-			GFP_KERNEL);
+	pages_to_free = kmalloc(npages_to_free * sizeof(struct page *), gfp);
 	if (!pages_to_free) {
 		pr_err("Failed to allocate memory for pool free operation\n");
 		return 0;
@@ -382,9 +383,7 @@ out:
  *
  * XXX: (dchinner) Deadlock warning!
  *
- * ttm_page_pool_free() does memory allocation using GFP_KERNEL.  that means
- * this can deadlock when called a sc->gfp_mask that is not equal to
- * GFP_KERNEL.
+ * We need to pass sc->gfp_mask to ttm_page_pool_free().
  *
  * This code is crying out for a shrinker per pool....
  */
@@ -408,7 +407,8 @@ ttm_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 		if (shrink_pages == 0)
 			break;
 		pool = &_manager->pools[(i + pool_offset)%NUM_POOLS];
-		shrink_pages = ttm_page_pool_free(pool, nr_free);
+		shrink_pages = ttm_page_pool_free(pool, nr_free,
+						  sc->gfp_mask);
 		freed += nr_free - shrink_pages;
 	}
 	mutex_unlock(&lock);
@@ -710,7 +710,7 @@ static void ttm_put_pages(struct page **pages, unsigned npages, int flags,
 	}
 	spin_unlock_irqrestore(&pool->lock, irq_flags);
 	if (npages)
-		ttm_page_pool_free(pool, npages);
+		ttm_page_pool_free(pool, npages, GFP_KERNEL);
 }
 
 /*
@@ -850,7 +850,8 @@ void ttm_page_alloc_fini(void)
 	ttm_pool_mm_shrink_fini(_manager);
 
 	for (i = 0; i < NUM_POOLS; ++i)
-		ttm_page_pool_free(&_manager->pools[i], FREE_ALL_PAGES);
+		ttm_page_pool_free(&_manager->pools[i], FREE_ALL_PAGES,
+				   GFP_KERNEL);
 
 	kobject_put(&_manager->kobj);
 	_manager = NULL;
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
index 620da39..d008cf4 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
@@ -411,8 +411,10 @@ static void ttm_dma_page_put(struct dma_pool *pool, struct dma_page *d_page)
  *
  * @pool: to free the pages from
  * @nr_free: If set to true will free all pages in pool
+ * @gfp: GFP flags.
  **/
-static unsigned ttm_dma_page_pool_free(struct dma_pool *pool, unsigned nr_free)
+static unsigned ttm_dma_page_pool_free(struct dma_pool *pool, unsigned nr_free,
+				       gfp_t gfp)
 {
 	unsigned long irq_flags;
 	struct dma_page *dma_p, *tmp;
@@ -430,8 +432,7 @@ static unsigned ttm_dma_page_pool_free(struct dma_pool *pool, unsigned nr_free)
 			 npages_to_free, nr_free);
 	}
 #endif
-	pages_to_free = kmalloc(npages_to_free * sizeof(struct page *),
-			GFP_KERNEL);
+	pages_to_free = kmalloc(npages_to_free * sizeof(struct page *), gfp);
 
 	if (!pages_to_free) {
 		pr_err("%s: Failed to allocate memory for pool free operation\n",
@@ -530,7 +531,7 @@ static void ttm_dma_free_pool(struct device *dev, enum pool_type type)
 		if (pool->type != type)
 			continue;
 		/* Takes a spinlock.. */
-		ttm_dma_page_pool_free(pool, FREE_ALL_PAGES);
+		ttm_dma_page_pool_free(pool, FREE_ALL_PAGES, GFP_KERNEL);
 		WARN_ON(((pool->npages_in_use + pool->npages_free) != 0));
 		/* This code path is called after _all_ references to the
 		 * struct device has been dropped - so nobody should be
@@ -983,7 +984,7 @@ void ttm_dma_unpopulate(struct ttm_dma_tt *ttm_dma, struct device *dev)
 
 	/* shrink pool if necessary (only on !is_cached pools)*/
 	if (npages)
-		ttm_dma_page_pool_free(pool, npages);
+		ttm_dma_page_pool_free(pool, npages, GFP_KERNEL);
 	ttm->state = tt_unpopulated;
 }
 EXPORT_SYMBOL_GPL(ttm_dma_unpopulate);
@@ -993,10 +994,7 @@ EXPORT_SYMBOL_GPL(ttm_dma_unpopulate);
  *
  * XXX: (dchinner) Deadlock warning!
  *
- * ttm_dma_page_pool_free() does GFP_KERNEL memory allocation, and so attention
- * needs to be paid to sc->gfp_mask to determine if this can be done or not.
- * GFP_KERNEL memory allocation in a GFP_ATOMIC reclaim context woul dbe really
- * bad.
+ * We need to pass sc->gfp_mask to ttm_dma_page_pool_free().
  *
  * I'm getting sadder as I hear more pathetical whimpers about needing per-pool
  * shrinkers
@@ -1030,7 +1028,8 @@ ttm_dma_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 		if (++idx < pool_offset)
 			continue;
 		nr_free = shrink_pages;
-		shrink_pages = ttm_dma_page_pool_free(p->pool, nr_free);
+		shrink_pages = ttm_dma_page_pool_free(p->pool, nr_free,
+						      sc->gfp_mask);
 		freed += nr_free - shrink_pages;
 
 		pr_debug("%s: (%s:%d) Asked to shrink %d, have %d more to go\n",
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/5] gpu/drm/ttm: Use mutex_trylock() to avoid deadlock inside shrinker functions.
  2014-05-31  3:00                   ` [PATCH 3/5] gpu/drm/ttm: Use mutex_trylock() to avoid deadlock inside shrinker functions Tetsuo Handa
  2014-05-31  3:01                     ` [PATCH 4/5] gpu/drm/ttm: Fix possible stack overflow by recursive shrinker calls Tetsuo Handa
@ 2014-06-10 19:17                     ` Konrad Rzeszutek Wilk
  2014-06-10 20:16                       ` Tetsuo Handa
  1 sibling, 1 reply; 21+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-06-10 19:17 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: dchinner, airlied, glommer, mgorman, linux-mm, linux-kernel, dri-devel

On Sat, May 31, 2014 at 12:00:45PM +0900, Tetsuo Handa wrote:
> >From 4e8d1a83629c5966bfd401c5f2187355624194f2 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Sat, 31 May 2014 09:59:44 +0900
> Subject: [PATCH 3/5] gpu/drm/ttm: Use mutex_trylock() to avoid deadlock inside shrinker functions.
> 
> I can observe that RHEL7 environment stalls with 100% CPU usage when a
> certain type of memory pressure is given. While the shrinker functions
> are called by shrink_slab() before the OOM killer is triggered, the stall
> lasts for many minutes.
> 
> One of reasons of this stall is that
> ttm_dma_pool_shrink_count()/ttm_dma_pool_shrink_scan() are called and
> are blocked at mutex_lock(&_manager->lock). GFP_KERNEL allocation with
> _manager->lock held causes someone (including kswapd) to deadlock when
> these functions are called due to memory pressure. This patch changes
> "mutex_lock();" to "if (!mutex_trylock()) return ...;" in order to
> avoid deadlock.
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: stable <stable@kernel.org> [3.3+]
> ---
>  drivers/gpu/drm/ttm/ttm_page_alloc_dma.c |    6 ++++--
>  1 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
> index d8e59f7..620da39 100644
> --- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
> +++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
> @@ -1014,7 +1014,8 @@ ttm_dma_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
>  	if (list_empty(&_manager->pools))
>  		return SHRINK_STOP;
>  
> -	mutex_lock(&_manager->lock);
> +	if (!mutex_lock(&_manager->lock))
> +		return SHRINK_STOP;

Hmm..

/home/konrad/linux/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c: In function ‘ttm_dma_pool_shrink_scan’:
/home/konrad/linux/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c:1015:2: error: invalid use of void expression
  if (!mutex_lock(&_manager->lock))

This is based on v3.15 with these patches.

>  	if (!_manager->npools)
>  		goto out;
>  	pool_offset = ++start_pool % _manager->npools;
> @@ -1047,7 +1048,8 @@ ttm_dma_pool_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
>  	struct device_pools *p;
>  	unsigned long count = 0;
>  
> -	mutex_lock(&_manager->lock);
> +	if (!mutex_trylock(&_manager->lock))
> +		return 0;
>  	list_for_each_entry(p, &_manager->pools, pools)
>  		count += p->pool->npages_free;
>  	mutex_unlock(&_manager->lock);
> -- 
> 1.7.1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/5] gpu/drm/ttm: Use mutex_trylock() to avoid deadlock inside shrinker functions.
  2014-06-10 19:17                     ` [PATCH 3/5] gpu/drm/ttm: Use mutex_trylock() to avoid deadlock inside shrinker functions Konrad Rzeszutek Wilk
@ 2014-06-10 20:16                       ` Tetsuo Handa
  2014-08-03 11:14                         ` [PATCH 1/5] gpu/drm/ttm: Fix possible division by 0 in ttm_dma_pool_shrink_scan() Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2014-06-10 20:16 UTC (permalink / raw)
  To: konrad.wilk
  Cc: dchinner, airlied, glommer, mgorman, linux-mm, linux-kernel, dri-devel

Konrad Rzeszutek Wilk wrote:
> Hmm..
> 
> /home/konrad/linux/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c: In function ‘ttm_dma_pool_shrink_scan’:
> /home/konrad/linux/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c:1015:2: error: invalid use of void expression
>   if (!mutex_lock(&_manager->lock))
> 
> This is based on v3.15 with these patches.

Wow! I didn't know that my gcc does not emit warning on such a mistake.
Thank you for catching this.
----------
>From 6e6774a87695408ef077cab576e76f7fa2cf4355 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 11 Jun 2014 05:10:50 +0900
Subject: [PATCH 3/5 (v2)] gpu/drm/ttm: Use mutex_trylock() to avoid deadlock inside shrinker functions.

I can observe that RHEL7 environment stalls with 100% CPU usage when a
certain type of memory pressure is given. While the shrinker functions
are called by shrink_slab() before the OOM killer is triggered, the stall
lasts for many minutes.

One of reasons of this stall is that
ttm_dma_pool_shrink_count()/ttm_dma_pool_shrink_scan() are called and
are blocked at mutex_lock(&_manager->lock). GFP_KERNEL allocation with
_manager->lock held causes someone (including kswapd) to deadlock when
these functions are called due to memory pressure. This patch changes
"mutex_lock();" to "if (!mutex_trylock()) return ...;" in order to
avoid deadlock.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: stable <stable@kernel.org> [3.3+]

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
index d8e59f7..524cc1a 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
@@ -1014,7 +1014,8 @@ ttm_dma_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 	if (list_empty(&_manager->pools))
 		return SHRINK_STOP;
 
-	mutex_lock(&_manager->lock);
+	if (!mutex_trylock(&_manager->lock))
+		return SHRINK_STOP;
 	if (!_manager->npools)
 		goto out;
 	pool_offset = ++start_pool % _manager->npools;
@@ -1047,7 +1048,8 @@ ttm_dma_pool_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
 	struct device_pools *p;
 	unsigned long count = 0;
 
-	mutex_lock(&_manager->lock);
+	if (!mutex_trylock(&_manager->lock))
+		return 0;
 	list_for_each_entry(p, &_manager->pools, pools)
 		count += p->pool->npages_free;
 	mutex_unlock(&_manager->lock);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 1/5] gpu/drm/ttm: Fix possible division by 0 in ttm_dma_pool_shrink_scan().
  2014-06-10 20:16                       ` Tetsuo Handa
@ 2014-08-03 11:14                         ` Tetsuo Handa
  2014-08-03 11:14                           ` [PATCH 2/5] gpu/drm/ttm: Choose a pool to shrink correctly " Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2014-08-03 11:14 UTC (permalink / raw)
  To: konrad.wilk, dchinner; +Cc: airlied, linux-kernel, dri-devel

>From 5c2f18ca300a1182e40f143b81e927426232b005 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sun, 3 Aug 2014 19:59:35 +0900
Subject: [PATCH 1/5] gpu/drm/ttm: Fix possible division by 0 in ttm_dma_pool_shrink_scan().

list_empty(&_manager->pools) being false before taking _manager->lock
does not guarantee that _manager->npools != 0 after taking _manager->lock
because _manager->npools is updated under _manager->lock.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: stable <stable@kernel.org> [3.3+]
---
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
index fb8259f..b751fff 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
@@ -1015,6 +1015,8 @@ ttm_dma_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 		return SHRINK_STOP;
 
 	mutex_lock(&_manager->lock);
+	if (!_manager->npools)
+		goto out;
 	pool_offset = pool_offset % _manager->npools;
 	list_for_each_entry(p, &_manager->pools, pools) {
 		unsigned nr_free;
@@ -1034,6 +1036,7 @@ ttm_dma_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 			 p->pool->dev_name, p->pool->name, current->pid,
 			 nr_free, shrink_pages);
 	}
+out:
 	mutex_unlock(&_manager->lock);
 	return freed;
 }
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/5] gpu/drm/ttm: Choose a pool to shrink correctly in ttm_dma_pool_shrink_scan().
  2014-08-03 11:14                         ` [PATCH 1/5] gpu/drm/ttm: Fix possible division by 0 in ttm_dma_pool_shrink_scan() Tetsuo Handa
@ 2014-08-03 11:14                           ` Tetsuo Handa
  2014-08-03 11:15                             ` [PATCH 3/5] gpu/drm/ttm: Use mutex_trylock() to avoid deadlock inside shrinker functions Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2014-08-03 11:14 UTC (permalink / raw)
  To: konrad.wilk, dchinner; +Cc: airlied, linux-kernel, dri-devel

>From ee941e18a3c05a7589b690669611c3f0895a2d5a Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sun, 3 Aug 2014 20:00:40 +0900
Subject: [PATCH 2/5] gpu/drm/ttm: Choose a pool to shrink correctly in ttm_dma_pool_shrink_scan().

We can use "unsigned int" instead of "atomic_t" by updating start_pool
variable under _manager->lock. This patch will make it possible to avoid
skipping when choosing a pool to shrink in round-robin style, after next
patch changes mutex_lock(_manager->lock) to !mutex_trylock(_manager->lork).

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: stable <stable@kernel.org> [3.3+]
---
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
index b751fff..d8e59f7 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
@@ -1004,9 +1004,9 @@ EXPORT_SYMBOL_GPL(ttm_dma_unpopulate);
 static unsigned long
 ttm_dma_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
-	static atomic_t start_pool = ATOMIC_INIT(0);
+	static unsigned start_pool;
 	unsigned idx = 0;
-	unsigned pool_offset = atomic_add_return(1, &start_pool);
+	unsigned pool_offset;
 	unsigned shrink_pages = sc->nr_to_scan;
 	struct device_pools *p;
 	unsigned long freed = 0;
@@ -1017,7 +1017,7 @@ ttm_dma_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 	mutex_lock(&_manager->lock);
 	if (!_manager->npools)
 		goto out;
-	pool_offset = pool_offset % _manager->npools;
+	pool_offset = ++start_pool % _manager->npools;
 	list_for_each_entry(p, &_manager->pools, pools) {
 		unsigned nr_free;
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/5] gpu/drm/ttm: Use mutex_trylock() to avoid deadlock inside shrinker functions.
  2014-08-03 11:14                           ` [PATCH 2/5] gpu/drm/ttm: Choose a pool to shrink correctly " Tetsuo Handa
@ 2014-08-03 11:15                             ` Tetsuo Handa
  2014-08-03 11:16                               ` [PATCH 4/5] gpu/drm/ttm: Fix possible stack overflow by recursive shrinker calls Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2014-08-03 11:15 UTC (permalink / raw)
  To: konrad.wilk, dchinner; +Cc: airlied, linux-kernel, dri-devel

>From e945aa0d6518563835c3f279b9c7f9fc3f20b38b Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sun, 3 Aug 2014 20:01:10 +0900
Subject: [PATCH 3/5] gpu/drm/ttm: Use mutex_trylock() to avoid deadlock inside shrinker functions.

I can observe that RHEL7 environment stalls with 100% CPU usage when a
certain type of memory pressure is given. While the shrinker functions
are called by shrink_slab() before the OOM killer is triggered, the stall
lasts for many minutes.

One of reasons of this stall is that
ttm_dma_pool_shrink_count()/ttm_dma_pool_shrink_scan() are called and
are blocked at mutex_lock(&_manager->lock). GFP_KERNEL allocation with
_manager->lock held causes someone (including kswapd) to deadlock when
these functions are called due to memory pressure. This patch changes
"mutex_lock();" to "if (!mutex_trylock()) return ...;" in order to
avoid deadlock.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: stable <stable@kernel.org> [3.3+]
---
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
index d8e59f7..524cc1a 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
@@ -1014,7 +1014,8 @@ ttm_dma_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 	if (list_empty(&_manager->pools))
 		return SHRINK_STOP;
 
-	mutex_lock(&_manager->lock);
+	if (!mutex_trylock(&_manager->lock))
+		return SHRINK_STOP;
 	if (!_manager->npools)
 		goto out;
 	pool_offset = ++start_pool % _manager->npools;
@@ -1047,7 +1048,8 @@ ttm_dma_pool_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
 	struct device_pools *p;
 	unsigned long count = 0;
 
-	mutex_lock(&_manager->lock);
+	if (!mutex_trylock(&_manager->lock))
+		return 0;
 	list_for_each_entry(p, &_manager->pools, pools)
 		count += p->pool->npages_free;
 	mutex_unlock(&_manager->lock);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/5] gpu/drm/ttm: Fix possible stack overflow by recursive shrinker calls.
  2014-08-03 11:15                             ` [PATCH 3/5] gpu/drm/ttm: Use mutex_trylock() to avoid deadlock inside shrinker functions Tetsuo Handa
@ 2014-08-03 11:16                               ` Tetsuo Handa
  2014-08-03 11:16                                 ` [PATCH 5/5] gpu/drm/ttm: Pass GFP flags in order to avoid deadlock Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2014-08-03 11:16 UTC (permalink / raw)
  To: konrad.wilk, dchinner; +Cc: airlied, linux-kernel, dri-devel

>From 16009d9def2c3087772e6c9dbec6c60950ae768b Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sun, 3 Aug 2014 20:02:03 +0900
Subject: [PATCH 4/5] gpu/drm/ttm: Fix possible stack overflow by recursive shrinker calls.

While ttm_dma_pool_shrink_scan() tries to take mutex before doing GFP_KERNEL
allocation, ttm_pool_shrink_scan() does not do it. This can result in stack
overflow if kmalloc() in ttm_page_pool_free() triggered recursion due to
memory pressure.

  shrink_slab()
  => ttm_pool_shrink_scan()
     => ttm_page_pool_free()
        => kmalloc(GFP_KERNEL)
           => shrink_slab()
              => ttm_pool_shrink_scan()
                 => ttm_page_pool_free()
                    => kmalloc(GFP_KERNEL)

Change ttm_pool_shrink_scan() to do like ttm_dma_pool_shrink_scan() does.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: stable <stable@kernel.org> [2.6.35+]
---
 drivers/gpu/drm/ttm/ttm_page_alloc.c |   10 +++++++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c b/drivers/gpu/drm/ttm/ttm_page_alloc.c
index beb8e75..edb8315 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
@@ -391,14 +391,17 @@ out:
 static unsigned long
 ttm_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
-	static atomic_t start_pool = ATOMIC_INIT(0);
+	static DEFINE_MUTEX(lock);
+	static unsigned start_pool;
 	unsigned i;
-	unsigned pool_offset = atomic_add_return(1, &start_pool);
+	unsigned pool_offset;
 	struct ttm_page_pool *pool;
 	int shrink_pages = sc->nr_to_scan;
 	unsigned long freed = 0;
 
-	pool_offset = pool_offset % NUM_POOLS;
+	if (!mutex_trylock(&lock))
+		return SHRINK_STOP;
+	pool_offset = ++start_pool % NUM_POOLS;
 	/* select start pool in round robin fashion */
 	for (i = 0; i < NUM_POOLS; ++i) {
 		unsigned nr_free = shrink_pages;
@@ -408,6 +411,7 @@ ttm_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 		shrink_pages = ttm_page_pool_free(pool, nr_free);
 		freed += nr_free - shrink_pages;
 	}
+	mutex_unlock(&lock);
 	return freed;
 }
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/5] gpu/drm/ttm: Pass GFP flags in order to avoid deadlock.
  2014-08-03 11:16                               ` [PATCH 4/5] gpu/drm/ttm: Fix possible stack overflow by recursive shrinker calls Tetsuo Handa
@ 2014-08-03 11:16                                 ` Tetsuo Handa
  0 siblings, 0 replies; 21+ messages in thread
From: Tetsuo Handa @ 2014-08-03 11:16 UTC (permalink / raw)
  To: konrad.wilk, dchinner; +Cc: airlied, linux-kernel, dri-devel

>From cf572226bd1f67305dc5477fccdc6daa30396994 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sun, 3 Aug 2014 20:02:31 +0900
Subject: [PATCH 5/5] gpu/drm/ttm: Pass GFP flags in order to avoid deadlock.

Commit 7dc19d5a "drivers: convert shrinkers to new count/scan API" added
deadlock warnings that ttm_page_pool_free() and ttm_dma_page_pool_free()
are currently doing GFP_KERNEL allocation.

But these functions did not get updated to receive gfp_t argument.
This patch explicitly passes sc->gfp_mask or GFP_KERNEL to these functions,
and removes the deadlock warning.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: stable <stable@kernel.org> [2.6.35+]
---
 drivers/gpu/drm/ttm/ttm_page_alloc.c     |   19 ++++++++++---------
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c |   19 +++++++++----------
 2 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c b/drivers/gpu/drm/ttm/ttm_page_alloc.c
index edb8315..09874d6 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
@@ -297,8 +297,10 @@ static void ttm_pool_update_free_locked(struct ttm_page_pool *pool,
  *
  * @pool: to free the pages from
  * @free_all: If set to true will free all pages in pool
+ * @gfp: GFP flags.
  **/
-static int ttm_page_pool_free(struct ttm_page_pool *pool, unsigned nr_free)
+static int ttm_page_pool_free(struct ttm_page_pool *pool, unsigned nr_free,
+			      gfp_t gfp)
 {
 	unsigned long irq_flags;
 	struct page *p;
@@ -309,8 +311,7 @@ static int ttm_page_pool_free(struct ttm_page_pool *pool, unsigned nr_free)
 	if (NUM_PAGES_TO_ALLOC < nr_free)
 		npages_to_free = NUM_PAGES_TO_ALLOC;
 
-	pages_to_free = kmalloc(npages_to_free * sizeof(struct page *),
-			GFP_KERNEL);
+	pages_to_free = kmalloc(npages_to_free * sizeof(struct page *), gfp);
 	if (!pages_to_free) {
 		pr_err("Failed to allocate memory for pool free operation\n");
 		return 0;
@@ -382,9 +383,7 @@ out:
  *
  * XXX: (dchinner) Deadlock warning!
  *
- * ttm_page_pool_free() does memory allocation using GFP_KERNEL.  that means
- * this can deadlock when called a sc->gfp_mask that is not equal to
- * GFP_KERNEL.
+ * We need to pass sc->gfp_mask to ttm_page_pool_free().
  *
  * This code is crying out for a shrinker per pool....
  */
@@ -408,7 +407,8 @@ ttm_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 		if (shrink_pages == 0)
 			break;
 		pool = &_manager->pools[(i + pool_offset)%NUM_POOLS];
-		shrink_pages = ttm_page_pool_free(pool, nr_free);
+		shrink_pages = ttm_page_pool_free(pool, nr_free,
+						  sc->gfp_mask);
 		freed += nr_free - shrink_pages;
 	}
 	mutex_unlock(&lock);
@@ -710,7 +710,7 @@ static void ttm_put_pages(struct page **pages, unsigned npages, int flags,
 	}
 	spin_unlock_irqrestore(&pool->lock, irq_flags);
 	if (npages)
-		ttm_page_pool_free(pool, npages);
+		ttm_page_pool_free(pool, npages, GFP_KERNEL);
 }
 
 /*
@@ -850,7 +850,8 @@ void ttm_page_alloc_fini(void)
 	ttm_pool_mm_shrink_fini(_manager);
 
 	for (i = 0; i < NUM_POOLS; ++i)
-		ttm_page_pool_free(&_manager->pools[i], FREE_ALL_PAGES);
+		ttm_page_pool_free(&_manager->pools[i], FREE_ALL_PAGES,
+				   GFP_KERNEL);
 
 	kobject_put(&_manager->kobj);
 	_manager = NULL;
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
index 524cc1a..ca65df1 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
@@ -411,8 +411,10 @@ static void ttm_dma_page_put(struct dma_pool *pool, struct dma_page *d_page)
  *
  * @pool: to free the pages from
  * @nr_free: If set to true will free all pages in pool
+ * @gfp: GFP flags.
  **/
-static unsigned ttm_dma_page_pool_free(struct dma_pool *pool, unsigned nr_free)
+static unsigned ttm_dma_page_pool_free(struct dma_pool *pool, unsigned nr_free,
+				       gfp_t gfp)
 {
 	unsigned long irq_flags;
 	struct dma_page *dma_p, *tmp;
@@ -430,8 +432,7 @@ static unsigned ttm_dma_page_pool_free(struct dma_pool *pool, unsigned nr_free)
 			 npages_to_free, nr_free);
 	}
 #endif
-	pages_to_free = kmalloc(npages_to_free * sizeof(struct page *),
-			GFP_KERNEL);
+	pages_to_free = kmalloc(npages_to_free * sizeof(struct page *), gfp);
 
 	if (!pages_to_free) {
 		pr_err("%s: Failed to allocate memory for pool free operation\n",
@@ -530,7 +531,7 @@ static void ttm_dma_free_pool(struct device *dev, enum pool_type type)
 		if (pool->type != type)
 			continue;
 		/* Takes a spinlock.. */
-		ttm_dma_page_pool_free(pool, FREE_ALL_PAGES);
+		ttm_dma_page_pool_free(pool, FREE_ALL_PAGES, GFP_KERNEL);
 		WARN_ON(((pool->npages_in_use + pool->npages_free) != 0));
 		/* This code path is called after _all_ references to the
 		 * struct device has been dropped - so nobody should be
@@ -983,7 +984,7 @@ void ttm_dma_unpopulate(struct ttm_dma_tt *ttm_dma, struct device *dev)
 
 	/* shrink pool if necessary (only on !is_cached pools)*/
 	if (npages)
-		ttm_dma_page_pool_free(pool, npages);
+		ttm_dma_page_pool_free(pool, npages, GFP_KERNEL);
 	ttm->state = tt_unpopulated;
 }
 EXPORT_SYMBOL_GPL(ttm_dma_unpopulate);
@@ -993,10 +994,7 @@ EXPORT_SYMBOL_GPL(ttm_dma_unpopulate);
  *
  * XXX: (dchinner) Deadlock warning!
  *
- * ttm_dma_page_pool_free() does GFP_KERNEL memory allocation, and so attention
- * needs to be paid to sc->gfp_mask to determine if this can be done or not.
- * GFP_KERNEL memory allocation in a GFP_ATOMIC reclaim context woul dbe really
- * bad.
+ * We need to pass sc->gfp_mask to ttm_dma_page_pool_free().
  *
  * I'm getting sadder as I hear more pathetical whimpers about needing per-pool
  * shrinkers
@@ -1030,7 +1028,8 @@ ttm_dma_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 		if (++idx < pool_offset)
 			continue;
 		nr_free = shrink_pages;
-		shrink_pages = ttm_dma_page_pool_free(p->pool, nr_free);
+		shrink_pages = ttm_dma_page_pool_free(p->pool, nr_free,
+						      sc->gfp_mask);
 		freed += nr_free - shrink_pages;
 
 		pr_debug("%s: (%s:%d) Asked to shrink %d, have %d more to go\n",
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2014-08-03 11:17 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-19 14:39 [PATCH] gpu/drm/ttm: Use mutex_lock_killable() for shrinker functions Tetsuo Handa
2014-05-20  0:40 ` Dave Airlie
2014-05-20 15:30   ` Tetsuo Handa
2014-05-24 14:22     ` Tetsuo Handa
2014-05-28 18:54       ` Konrad Rzeszutek Wilk
2014-05-28 21:47         ` Tetsuo Handa
2014-05-29 14:34           ` Tetsuo Handa
2014-05-30 16:08             ` Konrad Rzeszutek Wilk
2014-05-31  2:58               ` [PATCH 1/5] gpu/drm/ttm: Fix possible division by 0 in ttm_dma_pool_shrink_scan() Tetsuo Handa
2014-05-31  2:59                 ` [PATCH 2/5] gpu/drm/ttm: Choose a pool to shrink correctly " Tetsuo Handa
2014-05-31  3:00                   ` [PATCH 3/5] gpu/drm/ttm: Use mutex_trylock() to avoid deadlock inside shrinker functions Tetsuo Handa
2014-05-31  3:01                     ` [PATCH 4/5] gpu/drm/ttm: Fix possible stack overflow by recursive shrinker calls Tetsuo Handa
2014-05-31  3:02                       ` [PATCH 5/5] gpu/drm/ttm: Pass GFP flags in order to avoid deadlock Tetsuo Handa
2014-06-10 19:17                     ` [PATCH 3/5] gpu/drm/ttm: Use mutex_trylock() to avoid deadlock inside shrinker functions Konrad Rzeszutek Wilk
2014-06-10 20:16                       ` Tetsuo Handa
2014-08-03 11:14                         ` [PATCH 1/5] gpu/drm/ttm: Fix possible division by 0 in ttm_dma_pool_shrink_scan() Tetsuo Handa
2014-08-03 11:14                           ` [PATCH 2/5] gpu/drm/ttm: Choose a pool to shrink correctly " Tetsuo Handa
2014-08-03 11:15                             ` [PATCH 3/5] gpu/drm/ttm: Use mutex_trylock() to avoid deadlock inside shrinker functions Tetsuo Handa
2014-08-03 11:16                               ` [PATCH 4/5] gpu/drm/ttm: Fix possible stack overflow by recursive shrinker calls Tetsuo Handa
2014-08-03 11:16                                 ` [PATCH 5/5] gpu/drm/ttm: Pass GFP flags in order to avoid deadlock Tetsuo Handa
2014-05-30 16:06           ` [PATCH] gpu/drm/ttm: Use mutex_lock_killable() for shrinker functions Konrad Rzeszutek Wilk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).