From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_PASS,UNPARSEABLE_RELAY,USER_AGENT_NEOMUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9B8BBECDE3D for ; Fri, 19 Oct 2018 15:36:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 58CDB2145D for ; Fri, 19 Oct 2018 15:36:28 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="S4hxkkoL" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 58CDB2145D Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=oracle.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727631AbeJSXnD (ORCPT ); Fri, 19 Oct 2018 19:43:03 -0400 Received: from userp2120.oracle.com ([156.151.31.85]:58628 "EHLO userp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726465AbeJSXnD (ORCPT ); Fri, 19 Oct 2018 19:43:03 -0400 Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w9JFY069048968; Fri, 19 Oct 2018 15:36:04 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=date : from : to : cc : subject : message-id : references : mime-version : content-type : in-reply-to; s=corp-2018-07-02; bh=HXYgnJ3P5i00nVrBX3BRh1qEJz/jjKvHMDodfy1uMuE=; b=S4hxkkoLtY2CG3AX8T3NKKAwJtfSaIa2PdQa+iW3yGgtRkhaErLLAQ50/XRDzZ5YU/kw q6EKwrKf0aRM9hjaWyIM7R92BdSQ9lOH9e17cPXyI5imH5feGnydW8Fqa1XJgvL1lz3Z flr7RTVhKdAwsO9BIL2fkqZpuv4QoJ5Jzmnvqrx7/QYfZOY/YGuM/MP7ubtiAqwCTSQ4 QKdk82m+NbObQIk52pkxDYj8vIaDyZ6+pE3FfUWDAskz0+h0jnxLyuvW43nJQYXUoHU+ zonC1VGbSNSvynYNjms2h46whCjAGhK/7RbgzaE/WfSe4JkA0pAhKVruaAq5gJCKA33U cQ== Received: from userv0021.oracle.com (userv0021.oracle.com [156.151.31.71]) by userp2120.oracle.com with ESMTP id 2n39brvm9x-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 19 Oct 2018 15:36:04 +0000 Received: from aserv0122.oracle.com (aserv0122.oracle.com [141.146.126.236]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id w9JFa2BB016162 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 19 Oct 2018 15:36:02 GMT Received: from abhmp0019.oracle.com (abhmp0019.oracle.com [141.146.116.25]) by aserv0122.oracle.com (8.14.4/8.14.4) with ESMTP id w9JFZs8D020684; Fri, 19 Oct 2018 15:35:55 GMT Received: from ca-dmjordan1.us.oracle.com (/10.211.9.48) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Fri, 19 Oct 2018 08:35:51 -0700 Date: Fri, 19 Oct 2018 08:35:55 -0700 From: Daniel Jordan To: Vlastimil Babka Cc: Daniel Jordan , linux-mm@kvack.org, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, aaron.lu@intel.com, ak@linux.intel.com, akpm@linux-foundation.org, dave.dice@oracle.com, dave.hansen@linux.intel.com, hannes@cmpxchg.org, levyossi@icloud.com, ldufour@linux.vnet.ibm.com, mgorman@techsingularity.net, mhocko@kernel.org, Pavel.Tatashin@microsoft.com, steven.sistare@oracle.com, tim.c.chen@intel.com, vdavydov.dev@gmail.com, ying.huang@intel.com Subject: Re: [RFC PATCH v2 0/8] lru_lock scalability and SMP list functions Message-ID: <20181019153555.mza7t5siubhk3ohu@ca-dmjordan1.us.oracle.com> References: <20180911004240.4758-1-daniel.m.jordan@oracle.com> <2705c814-a6b8-0b14-7ea8-790325833d95@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <2705c814-a6b8-0b14-7ea8-790325833d95@suse.cz> User-Agent: NeoMutt/20180323-268-5a959c X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9050 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1810190137 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Oct 19, 2018 at 01:35:11PM +0200, Vlastimil Babka wrote: > On 9/11/18 2:42 AM, Daniel Jordan wrote: > > On large systems, lru_lock can become heavily contended in memory-intensive > > workloads such as decision support, applications that manage their memory > > manually by allocating and freeing pages directly from the kernel, and > > workloads with short-lived processes that force many munmap and exit > > operations. lru_lock also inhibits scalability in many of the MM paths that > > could be parallelized, such as freeing pages during exit/munmap and inode > > eviction. > > Interesting, I would have expected isolate_lru_pages() to be the main > culprit, as the comment says: > > * For pagecache intensive workloads, this function is the hottest > * spot in the kernel (apart from copy_*_user functions). Yes, I'm planning to stress reclaim to see how lru_lock responds. I've experimented some with using dd on lots of nvme drives to keep kswapd busy, but I'm always looking for more realistic stuff. Suggestions welcome :) > It also says "Some of the functions that shrink the lists perform better > by taking out a batch of pages and working on them outside the LRU > lock." Makes me wonder why isolate_lru_pages() also doesn't cut the list > first instead of doing per-page list_move() (and perhaps also prefetch > batch of struct pages outside the lock first? Could be doable with some > care hopefully). Seems like the batch prefetching and list cutting would go hand in hand, since cutting requires walking the LRU to find where to cut, which could miss on all the page list nodes along the way. I'll experiment with this. > > Second, lru_lock is converted from a spinlock to a rwlock. The idea is to > > repurpose rwlock as a two-mode lock, where callers take the lock in shared > > (i.e. read) mode for code using the SMP list functions, and exclusive (i.e. > > write) mode for existing code that expects exclusive access to the LRUs. > > Multiple threads are allowed in under the read lock, of course, and they use > > the SMP list functions to synchronize amongst themselves. > > > > The rwlock is scaffolding to facilitate the transition from big-hammer lru_lock > > as it exists today to just using the list locking primitives and getting rid of > > lru_lock entirely. Such an approach allows incremental conversion of lru_lock > > writers until everything uses the SMP list functions and takes the lock in > > shared mode, at which point lru_lock can just go away. > > Yeah I guess that will need more care, e.g. I think smp_list_del() can > break any thread doing just a read-only traversal as it can end up with > an entry that's been deleted and its next/prev poisoned. As far as I can see from checking everywhere the kernel takes lru_lock, nothing currently walks the LRUs. LRU-using code just deletes a page from anywhere, or adds one page at a time from the head or tail, so it seems safe to use smp_list_* for all LRU paths. This RFC doesn't handle adding and removing from list tails yet, but that seems doable. > It's a bit > counterintuitive that "read lock" is now enough for selected modify > operations, while read-only traversal would need a write lock. Yes, I considered introducing wrappers to clarify this, e.g. an inline function exclusive_lock_irqsave that just calls write_lock_irqsave, to let people know the locks are being used specially. Would be happy to add these in. Thanks for taking a look, Vlastimil, and for your comments!