From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1752710AbdCQDLN (ORCPT <rfc822;w@1wt.eu>);
        Thu, 16 Mar 2017 23:11:13 -0400
Received: from mga06.intel.com ([134.134.136.31]:61712 "EHLO mga06.intel.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1751389AbdCQDLL (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 16 Mar 2017 23:11:11 -0400
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.36,175,1486454400"; 
   d="scan'208";a="78047872"
Date: Fri, 17 Mar 2017 11:10:48 +0800
From: Aaron Lu <aaron.lu@intel.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
        Dave Hansen <dave.hansen@intel.com>, Tim Chen <tim.c.chen@intel.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Ying Huang <ying.huang@intel.com>
Subject: Re: [PATCH v2 0/5] mm: support parallel free of memory
Message-ID: <20170317031048.GC18964@aaronlu.sh.intel.com>
References: <1489568404-7817-1-git-send-email-aaron.lu@intel.com>
 <c2e172b1-fb2a-57a0-0074-a07a61693e6c@suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <c2e172b1-fb2a-57a0-0074-a07a61693e6c@suse.cz>
User-Agent: Mutt/1.8.0 (2017-02-23)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, Mar 15, 2017 at 03:56:02PM +0100, Vlastimil Babka wrote:
> I wonder if the difference would be larger if the parallelism was done
> on a higher level, something around unmap_page_range(). IIUC the current

I guess I misunderstand you in my last email - doing it at
unmap_page_range() level is essentially doing it at a per-VMA level
since it is the main function used in unmap_single_vma(). We have tried
that and felt that it's not flexible as the proposed approach since
it wouldn't parallize well for:
1 work load that uses only 1 or very few huge VMA;
2 work load that has a lot of small VMAs.

The code is nice and easy though(developed at v4.9 time frame):

>>From f6d5cfde888b9e0356719fabe8754fdfe6fe236b Mon Sep 17 00:00:00 2001
From: Aaron Lu <aaron.lu@intel.com>
Date: Wed, 11 Jan 2017 15:56:06 +0800
Subject: [PATCH] mm: async free vma

---
 include/linux/mm_types.h |  6 ++++++
 mm/memory.c              | 23 ++++++++++++++++++++++-
 2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4a8acedf4b7d..d10d2ce8f8f4 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -358,6 +358,12 @@ struct vm_area_struct {
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+
+	struct vma_free_ctx {
+		unsigned long start_addr;
+		unsigned long end_addr;
+		struct work_struct work;
+	} free_ctx;
 };
 
 struct core_thread {
diff --git a/mm/memory.c b/mm/memory.c
index e18c57bdc75c..0fe4e45a044b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1345,6 +1345,17 @@ static void unmap_single_vma(struct mmu_gather *tlb,
 	}
 }
 
+static void unmap_single_vma_work(struct work_struct *work)
+{
+	struct vma_free_ctx *ctx = container_of(work, struct vma_free_ctx, work);
+	struct vm_area_struct *vma = container_of(ctx, struct vm_area_struct, free_ctx);
+	struct mmu_gather tlb;
+
+	tlb_gather_mmu(&tlb, vma->vm_mm, ctx->start_addr, ctx->end_addr);
+	unmap_single_vma(&tlb, vma, ctx->start_addr, ctx->end_addr, NULL);
+	tlb_finish_mmu(&tlb, ctx->start_addr, ctx->end_addr);
+}
+
 /**
  * unmap_vmas - unmap a range of memory covered by a list of vma's
  * @tlb: address of the caller's struct mmu_gather
@@ -1368,10 +1379,20 @@ void unmap_vmas(struct mmu_gather *tlb,
 		unsigned long end_addr)
 {
 	struct mm_struct *mm = vma->vm_mm;
+	struct vma_free_ctx *ctx;
+	struct vm_area_struct *tmp = vma;
 
 	mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
+	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
+		ctx = &vma->free_ctx;
+		ctx->start_addr = start_addr;
+		ctx->end_addr = end_addr;
+		INIT_WORK(&ctx->work, unmap_single_vma_work);
+		queue_work(system_unbound_wq, &ctx->work);
+	}
+	vma = tmp;
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
-		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
+		flush_work(&vma->free_ctx.work);
 	mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
 }
 
-- 
2.9.3

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail-pf0-f200.google.com (mail-pf0-f200.google.com [209.85.192.200])
	by kanga.kvack.org (Postfix) with ESMTP id 7F5CF6B038C
	for <linux-mm@kvack.org>; Thu, 16 Mar 2017 23:10:40 -0400 (EDT)
Received: by mail-pf0-f200.google.com with SMTP id o126so116346352pfb.2
        for <linux-mm@kvack.org>; Thu, 16 Mar 2017 20:10:40 -0700 (PDT)
Received: from mga03.intel.com (mga03.intel.com. [134.134.136.65])
        by mx.google.com with ESMTPS id e21si7163637pgi.84.2017.03.16.20.10.39
        for <linux-mm@kvack.org>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Thu, 16 Mar 2017 20:10:39 -0700 (PDT)
Date: Fri, 17 Mar 2017 11:10:48 +0800
From: Aaron Lu <aaron.lu@intel.com>
Subject: Re: [PATCH v2 0/5] mm: support parallel free of memory
Message-ID: <20170317031048.GC18964@aaronlu.sh.intel.com>
References: <1489568404-7817-1-git-send-email-aaron.lu@intel.com>
 <c2e172b1-fb2a-57a0-0074-a07a61693e6c@suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <c2e172b1-fb2a-57a0-0074-a07a61693e6c@suse.cz>
Sender: owner-linux-mm@kvack.org
List-ID: <linux-mm.kvack.org>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dave Hansen <dave.hansen@intel.com>, Tim Chen <tim.c.chen@intel.com>, Andrew Morton <akpm@linux-foundation.org>, Ying Huang <ying.huang@intel.com>

On Wed, Mar 15, 2017 at 03:56:02PM +0100, Vlastimil Babka wrote:
> I wonder if the difference would be larger if the parallelism was done
> on a higher level, something around unmap_page_range(). IIUC the current

I guess I misunderstand you in my last email - doing it at
unmap_page_range() level is essentially doing it at a per-VMA level
since it is the main function used in unmap_single_vma(). We have tried
that and felt that it's not flexible as the proposed approach since
it wouldn't parallize well for:
1 work load that uses only 1 or very few huge VMA;
2 work load that has a lot of small VMAs.

The code is nice and easy though(developed at v4.9 time frame):