From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.3 required=3.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED,DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 77AC3C63777 for ; Mon, 16 Nov 2020 22:00:50 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id DD91C223AB for ; Mon, 16 Nov 2020 22:00:49 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="fIhhci2U" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org DD91C223AB Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id D34B86B0036; Mon, 16 Nov 2020 17:00:48 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id CE5656B005C; Mon, 16 Nov 2020 17:00:48 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B871E6B005D; Mon, 16 Nov 2020 17:00:48 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0008.hostedemail.com [216.40.44.8]) by kanga.kvack.org (Postfix) with ESMTP id 8977C6B0036 for ; Mon, 16 Nov 2020 17:00:48 -0500 (EST) Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 3699B1EE6 for ; Mon, 16 Nov 2020 22:00:48 +0000 (UTC) X-FDA: 77491651776.09.stone28_2410a552732c Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin09.hostedemail.com (Postfix) with ESMTP id 1A093180AD81A for ; Mon, 16 Nov 2020 22:00:48 +0000 (UTC) X-HE-Tag: stone28_2410a552732c X-Filterd-Recvd-Size: 13388 Received: from mail-lj1-f193.google.com (mail-lj1-f193.google.com [209.85.208.193]) by imf39.hostedemail.com (Postfix) with ESMTP for ; Mon, 16 Nov 2020 22:00:47 +0000 (UTC) Received: by mail-lj1-f193.google.com with SMTP id 142so8251498ljj.10 for ; Mon, 16 Nov 2020 14:00:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=81Ia4Jwsp53vRxqStiMdH77b+FaNNjG0LJemMzqrtEA=; b=fIhhci2Umlz/JAysKvwWA3sXuzC1AMaLfJ1tk13v0Il/5BsxBGd9DjlK6r6DAw375Z WMNM6f2Z5T6P3STfFpEBkYpjeDZ+n8fDhFkYC6yhZ0NsDCEaz3dAWZ5/Iapq8eRY1RK3 RF9ZnmSpl8eL8Moy5qMyhiLxxiu515th0dmFJbOH/EghOuqdnGIX5QSRgjNdFiGYiIVW sIR+wXO4CFbsGJtoHMEjCUJv1yMGWYa6ekdm4Bbe6G/5ZgxKBmK2TXubSSyuRFnMTVze kzEuzeRAxEp6+zFSDMehMauJy98P3kyy/rnqn6PY6RX2n+9U+Y+VWuwIJ/40Ar596+Sj mmMw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=81Ia4Jwsp53vRxqStiMdH77b+FaNNjG0LJemMzqrtEA=; b=s5TYq8hIVDUaT4B5FyoTqIkggILWpr9aw1qKnOEPb/r2c6hTf+79Y3/CLHcHszCB7O DMsOYXft+nW8AhL2kIkNhydDPdc+W3xJiBSoOx77PUsL1rSoZxNSUmjg9SgTdj8B6dLA Hqvx3pLwHtOWxuys9GZHXVuW+jQ07XASFzhiTqasrHKB7gyc0NLtSRER0NS411YnCTX/ XIU2mLgwuUmlaWHBcN7ZYtBDtepXeJGy/+jaY6lDRCy1RUYtJ8i9QykNCHa+GvYRM/KT ptuceFcZaB913Yhgy4f79vzVbcNPrzroByh730BidViVWrwKWJCdIL7AeZVzs7akFBoj Nsfg== X-Gm-Message-State: AOAM533YtAGfr581ArXV7nbZFlATY5RPJDEXwOR/dgX6PO8LxZzDXGqY 3Nwu8uVnG0sHqFJygNT+k6IlAWEWBtJMNA== X-Google-Smtp-Source: ABdhPJxIR7DYwAxIpSUVkV5VgxGUO45ub8vjSIuoejwUH/HlHAyemE8eVeQBzsLPiwuTmqamuQ7EJQ== X-Received: by 2002:a05:651c:1052:: with SMTP id x18mr510480ljm.208.1605564045752; Mon, 16 Nov 2020 14:00:45 -0800 (PST) Received: from pc638.lan (h5ef52e3d.seluork.dyn.perspektivbredband.net. [94.245.46.61]) by smtp.gmail.com with ESMTPSA id v130sm2870160lfa.283.2020.11.16.14.00.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 16 Nov 2020 14:00:45 -0800 (PST) From: "Uladzislau Rezki (Sony)" To: Andrew Morton Cc: linux-mm@kvack.org, LKML , Uladzislau Rezki , Hillf Danton , Michal Hocko , Matthew Wilcox , Oleksiy Avramchenko , Steven Rostedt Subject: [PATCH 2/2] mm/vmalloc: rework the drain logic Date: Mon, 16 Nov 2020 23:00:33 +0100 Message-Id: <20201116220033.1837-2-urezki@gmail.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20201116220033.1837-1-urezki@gmail.com> References: <20201116220033.1837-1-urezki@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: A current "lazy drain" model suffers from at least two issues. First one is related to the unsorted list of vmap areas, thus in order to identify the [min:max] range of areas to be drained, it requires a full list scan. What is a time consuming if the list is too long. Second one and as a next step is about merging all fragments with a free space. What is also a time consuming because it has to iterate over entire list which holds outstanding lazy areas. See below the "preemptirqsoff" tracer that illustrates a high latency. It is ~24 676us. Our workloads like audio and video are effected by such long latency: tracer: preemptirqsoff preemptirqsoff latency trace v1.1.5 on 4.9.186-perf+ -------------------------------------------------------------------- latency: 24676 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 P:8) ----------------- | task: crtc_commit:112-261 (uid:0 nice:0 policy:1 rt_prio:16) ----------------- =3D> started at: __purge_vmap_area_lazy =3D> ended at: __purge_vmap_area_lazy _------=3D> CPU# / _-----=3D> irqs-off | / _----=3D> need-resched || / _---=3D> hardirq/softirq ||| / _--=3D> preempt-depth |||| / delay cmd pid ||||| time | caller \ / ||||| \ | / crtc_com-261 1...1 1us*: _raw_spin_lock <-__purge_vmap_area_lazy [...] crtc_com-261 1...1 24675us : _raw_spin_unlock <-__purge_vmap_area_laz= y crtc_com-261 1...1 24677us : trace_preempt_on <-__purge_vmap_area_laz= y crtc_com-261 1...1 24683us : =3D> free_vmap_area_noflush =3D> remove_vm_area =3D> __vunmap =3D> vfree =3D> drm_property_free_blob =3D> drm_mode_object_unreference =3D> drm_property_unreference_blob =3D> __drm_atomic_helper_crtc_destroy_state =3D> sde_crtc_destroy_state =3D> drm_atomic_state_default_clear =3D> drm_atomic_state_clear =3D> drm_atomic_state_free =3D> complete_commit =3D> _msm_drm_commit_work_cb =3D> kthread_worker_fn =3D> kthread =3D> ret_from_fork To address those two issues we can redesign a purging of the outstanding lazy areas. Instead of queuing vmap areas to the list, we replace it by the separate rb-tree. In hat case an area is located in the tree/list in ascending order. It will give us below advantages: a) Outstanding vmap areas are merged creating bigger coalesced blocks, thus it becomes less fragmented. b) It is possible to calculate a flush range [min:max] without scanning all elements. It is O(1) access time or complexity; c) The final merge of areas with the rb-tree that represents a free space is faster because of (a). As a result the lock contention is also reduced. Signed-off-by: Uladzislau Rezki (Sony) --- include/linux/vmalloc.h | 8 ++-- mm/vmalloc.c | 90 +++++++++++++++++++++++------------------ 2 files changed, 53 insertions(+), 45 deletions(-) diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 938eaf9517e2..80c0181c411d 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -72,16 +72,14 @@ struct vmap_area { struct list_head list; /* address sorted list */ =20 /* - * The following three variables can be packed, because - * a vmap_area object is always one of the three states: + * The following two variables can be packed, because + * a vmap_area object can be either: * 1) in "free" tree (root is vmap_area_root) - * 2) in "busy" tree (root is free_vmap_area_root) - * 3) in purge list (head is vmap_purge_list) + * 2) or "busy" tree (root is free_vmap_area_root) */ union { unsigned long subtree_max_size; /* in "free" tree */ struct vm_struct *vm; /* in "busy" tree */ - struct llist_node purge_list; /* in purge list */ }; }; =20 diff --git a/mm/vmalloc.c b/mm/vmalloc.c index b08b06a8cc2a..f16a71fb0624 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -413,10 +413,13 @@ static DEFINE_SPINLOCK(vmap_area_lock); static DEFINE_SPINLOCK(free_vmap_area_lock); /* Export for kexec only */ LIST_HEAD(vmap_area_list); -static LLIST_HEAD(vmap_purge_list); static struct rb_root vmap_area_root =3D RB_ROOT; static bool vmap_initialized __read_mostly; =20 +static struct rb_root purge_vmap_area_root =3D RB_ROOT; +static LIST_HEAD(purge_vmap_area_list); +static DEFINE_SPINLOCK(purge_vmap_area_lock); + /* * This kmem_cache is used for vmap_area objects. Instead of * allocating from slab we reuse an object from this cache to @@ -820,10 +823,17 @@ merge_or_add_vmap_area(struct vmap_area *va, if (!merged) link_va(va, root, parent, link, head); =20 - /* - * Last step is to check and update the tree. - */ - augment_tree_propagate_from(va); + return va; +} + +static __always_inline struct vmap_area * +merge_or_add_vmap_area_augment(struct vmap_area *va, + struct rb_root *root, struct list_head *head) +{ + va =3D merge_or_add_vmap_area(va, root, head); + if (va) + augment_tree_propagate_from(va); + return va; } =20 @@ -1138,7 +1148,7 @@ static void free_vmap_area(struct vmap_area *va) * Insert/Merge it back to the free tree/list. */ spin_lock(&free_vmap_area_lock); - merge_or_add_vmap_area(va, &free_vmap_area_root, &free_vmap_area_list); + merge_or_add_vmap_area_augment(va, &free_vmap_area_root, &free_vmap_are= a_list); spin_unlock(&free_vmap_area_lock); } =20 @@ -1326,32 +1336,32 @@ void set_iounmap_nonlazy(void) static bool __purge_vmap_area_lazy(unsigned long start, unsigned long en= d) { unsigned long resched_threshold; - struct llist_node *valist; - struct vmap_area *va; - struct vmap_area *n_va; + struct list_head local_pure_list; + struct vmap_area *va, *n_va; =20 lockdep_assert_held(&vmap_purge_lock); =20 - valist =3D llist_del_all(&vmap_purge_list); - if (unlikely(valist =3D=3D NULL)) + spin_lock(&purge_vmap_area_lock); + purge_vmap_area_root =3D RB_ROOT; + list_replace_init(&purge_vmap_area_list, &local_pure_list); + spin_unlock(&purge_vmap_area_lock); + + if (unlikely(list_empty(&local_pure_list))) return false; =20 - /* - * TODO: to calculate a flush range without looping. - * The list can be up to lazy_max_pages() elements. - */ - llist_for_each_entry(va, valist, purge_list) { - if (va->va_start < start) - start =3D va->va_start; - if (va->va_end > end) - end =3D va->va_end; - } + start =3D min(start, + list_first_entry(&local_pure_list, + struct vmap_area, list)->va_start); + + end =3D max(end, + list_last_entry(&local_pure_list, + struct vmap_area, list)->va_end); =20 flush_tlb_kernel_range(start, end); resched_threshold =3D lazy_max_pages() << 1; =20 spin_lock(&free_vmap_area_lock); - llist_for_each_entry_safe(va, n_va, valist, purge_list) { + list_for_each_entry_safe(va, n_va, &local_pure_list, list) { unsigned long nr =3D (va->va_end - va->va_start) >> PAGE_SHIFT; unsigned long orig_start =3D va->va_start; unsigned long orig_end =3D va->va_end; @@ -1361,8 +1371,8 @@ static bool __purge_vmap_area_lazy(unsigned long st= art, unsigned long end) * detached and there is no need to "unlink" it from * anything. */ - va =3D merge_or_add_vmap_area(va, &free_vmap_area_root, - &free_vmap_area_list); + va =3D merge_or_add_vmap_area_augment(va, &free_vmap_area_root, + &free_vmap_area_list); =20 if (!va) continue; @@ -1419,9 +1429,15 @@ static void free_vmap_area_noflush(struct vmap_are= a *va) nr_lazy =3D atomic_long_add_return((va->va_end - va->va_start) >> PAGE_SHIFT, &vmap_lazy_nr); =20 - /* After this point, we may free va at any time */ - llist_add(&va->purge_list, &vmap_purge_list); + /* + * Merge or place it to the purge tree/list. + */ + spin_lock(&purge_vmap_area_lock); + merge_or_add_vmap_area(va, + &purge_vmap_area_root, &purge_vmap_area_list); + spin_unlock(&purge_vmap_area_lock); =20 + /* After this point, we may free va at any time */ if (unlikely(nr_lazy > lazy_max_pages())) try_purge_vmap_area_lazy(); } @@ -3351,8 +3367,8 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned= long *offsets, while (area--) { orig_start =3D vas[area]->va_start; orig_end =3D vas[area]->va_end; - va =3D merge_or_add_vmap_area(vas[area], &free_vmap_area_root, - &free_vmap_area_list); + va =3D merge_or_add_vmap_area_augment(vas[area], &free_vmap_area_root, + &free_vmap_area_list); if (va) kasan_release_vmalloc(orig_start, orig_end, va->va_start, va->va_end); @@ -3401,8 +3417,8 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned= long *offsets, for (area =3D 0; area < nr_vms; area++) { orig_start =3D vas[area]->va_start; orig_end =3D vas[area]->va_end; - va =3D merge_or_add_vmap_area(vas[area], &free_vmap_area_root, - &free_vmap_area_list); + va =3D merge_or_add_vmap_area_augment(vas[area], &free_vmap_area_root, + &free_vmap_area_list); if (va) kasan_release_vmalloc(orig_start, orig_end, va->va_start, va->va_end); @@ -3482,18 +3498,15 @@ static void show_numa_info(struct seq_file *m, st= ruct vm_struct *v) =20 static void show_purge_info(struct seq_file *m) { - struct llist_node *head; struct vmap_area *va; =20 - head =3D READ_ONCE(vmap_purge_list.first); - if (head =3D=3D NULL) - return; - - llist_for_each_entry(va, head, purge_list) { + spin_lock(&purge_vmap_area_lock); + list_for_each_entry(va, &purge_vmap_area_list, list) { seq_printf(m, "0x%pK-0x%pK %7ld unpurged vm_area\n", (void *)va->va_start, (void *)va->va_end, va->va_end - va->va_start); } + spin_unlock(&purge_vmap_area_lock); } =20 static int s_show(struct seq_file *m, void *p) @@ -3551,10 +3564,7 @@ static int s_show(struct seq_file *m, void *p) seq_putc(m, '\n'); =20 /* - * As a final step, dump "unpurged" areas. Note, - * that entire "/proc/vmallocinfo" output will not - * be address sorted, because the purge list is not - * sorted. + * As a final step, dump "unpurged" areas. */ if (list_is_last(&va->list, &vmap_area_list)) show_purge_info(m); --=20 2.20.1