From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753971AbeCFVRr (ORCPT ); Tue, 6 Mar 2018 16:17:47 -0500 Received: from out30-132.freemail.mail.aliyun.com ([115.124.30.132]:39630 "EHLO out30-132.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753598AbeCFVRq (ORCPT ); Tue, 6 Mar 2018 16:17:46 -0500 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R211e4;CH=green;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01f04446;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=6;SR=0;TI=SMTPD_---0Sz.Z2.j_1520371058; Subject: Re: [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc To: Andrew Morton Cc: mingo@kernel.org, adobriyan@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, David Rientjes References: <1519691151-101999-1-git-send-email-yang.shi@linux.alibaba.com> <20180306124540.d8b5f6da97ab69a49566f950@linux-foundation.org> From: Yang Shi Message-ID: Date: Tue, 6 Mar 2018 13:17:37 -0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 MIME-Version: 1.0 In-Reply-To: <20180306124540.d8b5f6da97ab69a49566f950@linux-foundation.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 3/6/18 12:45 PM, Andrew Morton wrote: > On Tue, 27 Feb 2018 08:25:47 +0800 Yang Shi wrote: > >> Background: >> When running vm-scalability with large memory (> 300GB), the below hung >> task issue happens occasionally. >> >> INFO: task ps:14018 blocked for more than 120 seconds. >> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> ps D 0 14018 1 0x00000004 >> ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0 >> ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040 >> 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000 >> Call Trace: >> [] ? __schedule+0x250/0x730 >> [] schedule+0x36/0x80 >> [] rwsem_down_read_failed+0xf0/0x150 >> [] call_rwsem_down_read_failed+0x18/0x30 >> [] down_read+0x20/0x40 >> [] proc_pid_cmdline_read+0xd9/0x4e0 >> [] ? do_filp_open+0xa5/0x100 >> [] __vfs_read+0x37/0x150 >> [] ? security_file_permission+0x9b/0xc0 >> [] vfs_read+0x96/0x130 >> [] SyS_read+0x55/0xc0 >> [] entry_SYSCALL_64_fastpath+0x1a/0xc5 >> >> When manipulating a large mapping, the process may hold the mmap_sem for >> long time, so reading /proc//cmdline may be blocked in >> uninterruptible state for long time. >> We already have killable version APIs for semaphore, here use down_read_killable() >> to improve the responsiveness. >> > Maybe I'm missing something, but I don't see how this solves the > problem. Yes, the read of /proc/pid/cmdline will be abandoned if > someone interrupts that process. But if nobody does that, the read > will still just sit there for 2 minutes and the watchdog warning will > still come out? No, the warning will not come out since down_read_killable() puts the task into TASK_KILLABLE state instead of TASK_UNINTERRUPTIBLE state. The hung task check will skip TASK_KILLABLE tasks, please see the below code in (kernel/hung_task.c): /* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */ if (t->state == TASK_UNINTERRUPTIBLE) check_hung_task(t, timeout); It just mitigates the hung task warning, can't resolve the mmap_sem scalability issue. Furthermore, waiting on pure uninterruptible state for reading /proc sounds unnecessary. It doesn't wait for I/O completion. > > Where the heck are we holding mmap_sem for so long? Can that be fixed? The mmap_sem is held for unmapping a large map which has every single page mapped. This is not a issue in real production code. Just found it by running vm-scalability on a machine with ~600GB memory. AFAIK, I don't see any easy fix for the mmap_sem scalability issue. I saw range locking patches (https://lwn.net/Articles/723648/) were floating around. But, it may not help too much on the case that a large map with every single page mapped. Thanks, Yang