From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qt0-f199.google.com (mail-qt0-f199.google.com [209.85.216.199]) by kanga.kvack.org (Postfix) with ESMTP id 830736B0292 for ; Thu, 20 Jul 2017 14:45:35 -0400 (EDT) Received: by mail-qt0-f199.google.com with SMTP id r14so22300962qte.11 for ; Thu, 20 Jul 2017 11:45:35 -0700 (PDT) Received: from mail-qt0-x243.google.com (mail-qt0-x243.google.com. [2607:f8b0:400d:c0d::243]) by mx.google.com with ESMTPS id o3si2398482qtc.72.2017.07.20.11.45.34 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 20 Jul 2017 11:45:34 -0700 (PDT) Received: by mail-qt0-x243.google.com with SMTP id 50so4525100qtz.0 for ; Thu, 20 Jul 2017 11:45:34 -0700 (PDT) From: josef@toxicpanda.com Subject: [PATCH 0/2][V3] slab and general reclaim improvements Date: Thu, 20 Jul 2017 14:45:29 -0400 Message-Id: <1500576331-31214-1-git-send-email-jbacik@fb.com> Sender: owner-linux-mm@kvack.org List-ID: To: minchan@kernel.org, linux-mm@kvack.org, hannes@cmpxchg.org, riel@redhat.com, akpm@linux-foundation.org, david@fromorbit.com, kernel-team@fb.com This is a new set of patches to address some slab reclaim issues I observed when trying to convert btrfs over to a purely slab meta data system. The problem is our slab reclaim amounts are based on how page cache scanning goes, which is not related to how much slab is in use on the system. This means we waste a lot of cycles trying to reclaim really small numbers of objects and thus induce a huge latency hit when we hit memory pressure. The second patch is to fix another problem I noticed, namely that we will constantly evict active pages instead of trying to evict shorter lived pages and objects. You can run the following test to see the behavior https://github.com/josefbacik/debug-scripts/blob/master/cache-pressure/cache-pressure.sh Reading two files that fit nicely into RAM, and then add slab pressure from fs_mark and you'll see that we will evict both files 3 or 4 times during the run. With my patches we don't evict any of the file pages at all, only the slab pressure. CHANGES SINCE V2: After discussions with Minchan and Dave I went back to the drawing board. Dave's suggestion of shrinker specific callbacks to allow shrinkers to reclaim at their own rate was intriguing so I did this first. Unfortunately this fell apart as I couldn't figure out a scheme to make the reclaim stuff work without needlessly inducing latency during 'stable' periods. Without a view of the whole system it's hard to know when to trigger scanning of active objects without wasting CPU cycles that aren't actually needed. We will end up starting our background thread and scanning objects when in reality our workload would fit perfectly fine in memory. Minchan's main complaint of my stuff was that it was too aggressive, and any attempt to short circuit the aggression would be unfair to other shrinkers. So instead use his idea of using sc->priority to determine our scan count. But instead of using the traditional ratio, just use scan = nr_objects >> sc->priority. This gets us what we do with the LRU, which is scan = pages >> sc->priority, and gives us an appropriate amount of aggressiveness for slab reclaim. I've dropped the other two patches around stopping slab reclaim, and have changed the slab pressure to be based on sc->priority, which is consistent with every other LRU on the system, and gives me the same results in my testcases. Thanks, Josef -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qt0-f197.google.com (mail-qt0-f197.google.com [209.85.216.197]) by kanga.kvack.org (Postfix) with ESMTP id 1E19A6B02B4 for ; Thu, 20 Jul 2017 14:45:37 -0400 (EDT) Received: by mail-qt0-f197.google.com with SMTP id p25so22479386qtp.4 for ; Thu, 20 Jul 2017 11:45:37 -0700 (PDT) Received: from mail-qt0-x243.google.com (mail-qt0-x243.google.com. [2607:f8b0:400d:c0d::243]) by mx.google.com with ESMTPS id i68si2451260qke.182.2017.07.20.11.45.35 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 20 Jul 2017 11:45:35 -0700 (PDT) Received: by mail-qt0-x243.google.com with SMTP id o3so744886qto.1 for ; Thu, 20 Jul 2017 11:45:35 -0700 (PDT) From: josef@toxicpanda.com Subject: [PATCH 1/2] mm: use sc->priority for slab shrink targets Date: Thu, 20 Jul 2017 14:45:30 -0400 Message-Id: <1500576331-31214-2-git-send-email-jbacik@fb.com> In-Reply-To: <1500576331-31214-1-git-send-email-jbacik@fb.com> References: <1500576331-31214-1-git-send-email-jbacik@fb.com> Sender: owner-linux-mm@kvack.org List-ID: To: minchan@kernel.org, linux-mm@kvack.org, hannes@cmpxchg.org, riel@redhat.com, akpm@linux-foundation.org, david@fromorbit.com, kernel-team@fb.com Cc: Josef Bacik From: Josef Bacik Previously we were using the ratio of the number of lru pages scanned to the number of eligible lru pages to determine the number of slab objects to scan. The problem with this is that these two things have nothing to do with each other, so in slab heavy work loads where there is little to no page cache we can end up with the pages scanned being a very low number. This means that we reclaim next to no slab pages and waste a lot of time reclaiming small amounts of space. Instead use sc->priority in the same way we use it to determine scan amounts for the lru's. This generally equates to pages. Consider the following slab_pages = (nr_objects * object_size) / PAGE_SIZE What we would like to do is scan = slab_pages >> sc->priority but we don't know the number of slab pages each shrinker controls, only the objects. However say that theoretically we knew how many pages a shrinker controlled, we'd still have to convert this to objects, which would look like the following scan = shrinker_pages >> sc->priority scan_objects = (PAGE_SIZE / object_size) * scan or written another way scan_objects = (shrinker_pages >> sc->priority) * (PAGE_SIZE / object_size) which can thus be written scan_objects = ((shrinker_pages * PAGE_SIZE) / object_size) >> sc->priority which is just scan_objects = nr_objects >> sc->priority We don't need to know exactly how many pages each shrinker represents, it's objects are all the information we need. Making this change allows us to place an appropriate amount of pressure on the shrinker pools for their relative size. Signed-off-by: Josef Bacik --- include/trace/events/vmscan.h | 23 ++++++++++------------ mm/vmscan.c | 46 +++++++++++-------------------------------- 2 files changed, 22 insertions(+), 47 deletions(-) diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h index 27e8a5c..8c5a00a 100644 --- a/include/trace/events/vmscan.h +++ b/include/trace/events/vmscan.h @@ -187,12 +187,12 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_memcg_softlimit_re TRACE_EVENT(mm_shrink_slab_start, TP_PROTO(struct shrinker *shr, struct shrink_control *sc, - long nr_objects_to_shrink, unsigned long pgs_scanned, - unsigned long lru_pgs, unsigned long cache_items, - unsigned long long delta, unsigned long total_scan), + long nr_objects_to_shrink, unsigned long cache_items, + unsigned long long delta, unsigned long total_scan, + int priority), - TP_ARGS(shr, sc, nr_objects_to_shrink, pgs_scanned, lru_pgs, - cache_items, delta, total_scan), + TP_ARGS(shr, sc, nr_objects_to_shrink, cache_items, delta, total_scan, + priority), TP_STRUCT__entry( __field(struct shrinker *, shr) @@ -200,11 +200,10 @@ TRACE_EVENT(mm_shrink_slab_start, __field(int, nid) __field(long, nr_objects_to_shrink) __field(gfp_t, gfp_flags) - __field(unsigned long, pgs_scanned) - __field(unsigned long, lru_pgs) __field(unsigned long, cache_items) __field(unsigned long long, delta) __field(unsigned long, total_scan) + __field(int, priority) ), TP_fast_assign( @@ -213,24 +212,22 @@ TRACE_EVENT(mm_shrink_slab_start, __entry->nid = sc->nid; __entry->nr_objects_to_shrink = nr_objects_to_shrink; __entry->gfp_flags = sc->gfp_mask; - __entry->pgs_scanned = pgs_scanned; - __entry->lru_pgs = lru_pgs; __entry->cache_items = cache_items; __entry->delta = delta; __entry->total_scan = total_scan; + __entry->priority = priority; ), - TP_printk("%pF %p: nid: %d objects to shrink %ld gfp_flags %s pgs_scanned %ld lru_pgs %ld cache items %ld delta %lld total_scan %ld", + TP_printk("%pF %p: nid: %d objects to shrink %ld gfp_flags %s cache items %ld delta %lld total_scan %ld priority %d", __entry->shrink, __entry->shr, __entry->nid, __entry->nr_objects_to_shrink, show_gfp_flags(__entry->gfp_flags), - __entry->pgs_scanned, - __entry->lru_pgs, __entry->cache_items, __entry->delta, - __entry->total_scan) + __entry->total_scan, + __entry->priority) ); TRACE_EVENT(mm_shrink_slab_end, diff --git a/mm/vmscan.c b/mm/vmscan.c index f84cdd3..a6f33ef 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -306,9 +306,7 @@ EXPORT_SYMBOL(unregister_shrinker); #define SHRINK_BATCH 128 static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, - struct shrinker *shrinker, - unsigned long nr_scanned, - unsigned long nr_eligible) + struct shrinker *shrinker, int priority) { unsigned long freed = 0; unsigned long long delta; @@ -333,9 +331,8 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0); total_scan = nr; - delta = (4 * nr_scanned) / shrinker->seeks; - delta *= freeable; - do_div(delta, nr_eligible + 1); + delta = freeable >> priority; + delta = (4 * freeable) / shrinker->seeks; total_scan += delta; if (total_scan < 0) { pr_err("shrink_slab: %pF negative objects to delete nr=%ld\n", @@ -369,8 +366,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, total_scan = freeable * 2; trace_mm_shrink_slab_start(shrinker, shrinkctl, nr, - nr_scanned, nr_eligible, - freeable, delta, total_scan); + freeable, delta, total_scan, priority); /* * Normally, we should not scan less than batch_size objects in one @@ -429,8 +425,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, * @gfp_mask: allocation context * @nid: node whose slab caches to target * @memcg: memory cgroup whose slab caches to target - * @nr_scanned: pressure numerator - * @nr_eligible: pressure denominator + * @priority: the reclaim priority * * Call the shrink functions to age shrinkable caches. * @@ -442,20 +437,14 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, * objects from the memory cgroup specified. Otherwise, only unaware * shrinkers are called. * - * @nr_scanned and @nr_eligible form a ratio that indicate how much of - * the available objects should be scanned. Page reclaim for example - * passes the number of pages scanned and the number of pages on the - * LRU lists that it considered on @nid, plus a bias in @nr_scanned - * when it encountered mapped pages. The ratio is further biased by - * the ->seeks setting of the shrink function, which indicates the - * cost to recreate an object relative to that of an LRU page. + * @priority is sc->priority, we take the number of objects and >> by priority + * in order to get the scan target. * * Returns the number of reclaimed slab objects. */ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg, - unsigned long nr_scanned, - unsigned long nr_eligible) + int priority) { struct shrinker *shrinker; unsigned long freed = 0; @@ -463,9 +452,6 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, if (memcg && (!memcg_kmem_enabled() || !mem_cgroup_online(memcg))) return 0; - if (nr_scanned == 0) - nr_scanned = SWAP_CLUSTER_MAX; - if (!down_read_trylock(&shrinker_rwsem)) { /* * If we would return 0, our callers would understand that we @@ -496,7 +482,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) sc.nid = 0; - freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible); + freed += do_shrink_slab(&sc, shrinker, priority); } up_read(&shrinker_rwsem); @@ -514,8 +500,7 @@ void drop_slab_node(int nid) freed = 0; do { - freed += shrink_slab(GFP_KERNEL, nid, memcg, - 1000, 1000); + freed += shrink_slab(GFP_KERNEL, nid, memcg, 0); } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL); } while (freed > 10); } @@ -2592,14 +2577,12 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) reclaimed = sc->nr_reclaimed; scanned = sc->nr_scanned; - shrink_node_memcg(pgdat, memcg, sc, &lru_pages); node_lru_pages += lru_pages; if (memcg) shrink_slab(sc->gfp_mask, pgdat->node_id, - memcg, sc->nr_scanned - scanned, - lru_pages); + memcg, sc->priority); /* Record the group's reclaim efficiency */ vmpressure(sc->gfp_mask, memcg, false, @@ -2623,14 +2606,9 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) } } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim))); - /* - * Shrink the slab caches in the same proportion that - * the eligible LRU pages were scanned. - */ if (global_reclaim(sc)) shrink_slab(sc->gfp_mask, pgdat->node_id, NULL, - sc->nr_scanned - nr_scanned, - node_lru_pages); + sc->priority); /* * Record the subtree's reclaim efficiency. The reclaimed -- 2.7.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f200.google.com (mail-qk0-f200.google.com [209.85.220.200]) by kanga.kvack.org (Postfix) with ESMTP id 3EF0A6B02C3 for ; Thu, 20 Jul 2017 14:45:39 -0400 (EDT) Received: by mail-qk0-f200.google.com with SMTP id y126so12609900qke.0 for ; Thu, 20 Jul 2017 11:45:39 -0700 (PDT) Received: from mail-qt0-x241.google.com (mail-qt0-x241.google.com. [2607:f8b0:400d:c0d::241]) by mx.google.com with ESMTPS id k33si2393082qtd.431.2017.07.20.11.45.37 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 20 Jul 2017 11:45:37 -0700 (PDT) Received: by mail-qt0-x241.google.com with SMTP id 50so4525225qtz.0 for ; Thu, 20 Jul 2017 11:45:37 -0700 (PDT) From: josef@toxicpanda.com Subject: [PATCH 2/2] mm: make kswapd try harder to keep active pages in cache Date: Thu, 20 Jul 2017 14:45:31 -0400 Message-Id: <1500576331-31214-3-git-send-email-jbacik@fb.com> In-Reply-To: <1500576331-31214-1-git-send-email-jbacik@fb.com> References: <1500576331-31214-1-git-send-email-jbacik@fb.com> Sender: owner-linux-mm@kvack.org List-ID: To: minchan@kernel.org, linux-mm@kvack.org, hannes@cmpxchg.org, riel@redhat.com, akpm@linux-foundation.org, david@fromorbit.com, kernel-team@fb.com Cc: Josef Bacik From: Josef Bacik While testing slab reclaim I noticed that if we were running a workload that used most of the system memory for it's working set and we start putting a lot of reclaimable slab pressure on the system (think find /, or some other silliness), we will happily evict the active pages over the slab cache. This is kind of backwards as we want to do all that we can to keep the active working set in memory, and instead evict these short lived objects. The same thing occurs when say you do a yum update of a few packages while your working set takes up most of RAM, you end up with inactive lists being relatively small and so we reclaim active pages even though we could reclaim these short lived inactive pages. My approach here is twofold. First, keep track of the difference in inactive and slab pages since the last time kswapd ran. In the first run this will just be the overall counts of inactive and slab, but for each subsequent run we'll have a good idea of where the memory pressure is coming from. Then we use this information to put pressure on either the inactive lists or the slab caches, depending on where the pressure is coming from. I have two tests I was using to watch either side of this problem. The first test kept 2 files that took up 3/4 of the memory, and then started creating a bunch of empty files. Without this patch we would have to re-read both files in their entirety at least 3 times during the run. With this patch the active pages are never evicted. The second test was a test that would read and stat all the files in a directory, which again would take up about 3/4 of the memory with slab cache. Then I cat'ed a 100gib file into /dev/null and checked to see if any of the files were evicted and verified that none of the files were evicted. Signed-off-by: Josef Bacik --- mm/vmscan.c | 172 +++++++++++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 158 insertions(+), 14 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index a6f33ef..aa73368 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -110,11 +110,20 @@ struct scan_control { /* One of the zones is ready for compaction */ unsigned int compaction_ready:1; + /* Only reclaim inactive page cache or slab. */ + unsigned int inactive_only:1; + /* Incremented by the number of inactive pages that were scanned */ unsigned long nr_scanned; /* Number of pages freed so far during a call to shrink_zones() */ unsigned long nr_reclaimed; + + /* Number of inactive pages added since last kswapd run. */ + unsigned long inactive_diff; + + /* Number of slab pages added since last kswapd run. */ + unsigned long slab_diff; }; #ifdef ARCH_HAS_PREFETCH @@ -306,7 +315,8 @@ EXPORT_SYMBOL(unregister_shrinker); #define SHRINK_BATCH 128 static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, - struct shrinker *shrinker, int priority) + struct shrinker *shrinker, int priority, + unsigned long *slab_scanned) { unsigned long freed = 0; unsigned long long delta; @@ -405,6 +415,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, next_deferred -= scanned; else next_deferred = 0; + if (slab_scanned) + (*slab_scanned) += scanned; + /* * move the unused scan count back into the shrinker in a * manner that handles concurrent updates. If we exhausted the @@ -444,7 +457,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, */ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg, - int priority) + int priority, unsigned long *slab_scanned) { struct shrinker *shrinker; unsigned long freed = 0; @@ -482,7 +495,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) sc.nid = 0; - freed += do_shrink_slab(&sc, shrinker, priority); + freed += do_shrink_slab(&sc, shrinker, priority, slab_scanned); } up_read(&shrinker_rwsem); @@ -500,7 +513,7 @@ void drop_slab_node(int nid) freed = 0; do { - freed += shrink_slab(GFP_KERNEL, nid, memcg, 0); + freed += shrink_slab(GFP_KERNEL, nid, memcg, 0, NULL); } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL); } while (freed > 10); } @@ -2131,6 +2144,7 @@ enum scan_balance { SCAN_FRACT, SCAN_ANON, SCAN_FILE, + SCAN_INACTIVE, }; /* @@ -2157,6 +2171,11 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg, unsigned long ap, fp; enum lru_list lru; + if (sc->inactive_only) { + scan_balance = SCAN_INACTIVE; + goto out; + } + /* If we have no swap space, do not bother scanning anon pages. */ if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) { scan_balance = SCAN_FILE; @@ -2330,6 +2349,14 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg, scan = 0; } break; + case SCAN_INACTIVE: + if (file && !is_active_lru(lru)) { + scan = max(scan, sc->nr_to_reclaim); + } else { + size = 0; + scan = 0; + } + break; default: /* Look ma, no brain */ BUG(); @@ -2547,7 +2574,61 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) { struct reclaim_state *reclaim_state = current->reclaim_state; unsigned long nr_reclaimed, nr_scanned; + unsigned long greclaim = 1, gslab = 1, total_high_wmark = 0, nr_inactive; + int priority_adj = 1; bool reclaimable = false; + bool skip_slab = false; + + if (global_reclaim(sc)) { + int z; + for (z = 0; z < MAX_NR_ZONES; z++) { + struct zone *zone = &pgdat->node_zones[z]; + if (!managed_zone(zone)) + continue; + total_high_wmark += high_wmark_pages(zone); + } + nr_inactive = node_page_state(pgdat, NR_INACTIVE_FILE); + gslab = node_page_state(pgdat, NR_SLAB_RECLAIMABLE); + greclaim = pgdat_reclaimable_pages(pgdat); + } else { + struct lruvec *lruvec = + mem_cgroup_lruvec(pgdat, sc->target_mem_cgroup); + total_high_wmark = sc->nr_to_reclaim; + nr_inactive = lruvec_page_state(lruvec, NR_INACTIVE_FILE); + gslab = lruvec_page_state(lruvec, NR_SLAB_RECLAIMABLE); + } + + /* + * If we don't have a lot of inactive or slab pages then there's no + * point in trying to free them exclusively, do the normal scan stuff. + */ + if (nr_inactive + gslab < total_high_wmark) + sc->inactive_only = 0; + + /* + * We still want to slightly prefer slab over inactive, so if the + * inactive on this node is large enough and what is pushing us into + * reclaim territory then limit our flushing to the inactive list for + * the first go around. + * + * The idea is that with a memcg configured system we will still reclaim + * memcg aware shrinkers, which includes the super block shrinkers. So + * if our steady state is keeping fs objects in cache for our workload + * we'll still put a certain amount of pressure on them anyway. To + * avoid evicting things we actually care about we want to skip slab + * reclaim altogether. + * + * However we still want to account for slab and inactive growing at the + * same rate, so if that is the case just carry on shrinking inactive + * and slab together. + */ + if (nr_inactive > total_high_wmark && + sc->inactive_diff > sc->slab_diff) { + unsigned long tmp = sc->inactive_diff >> 1; + + if (tmp >= sc->slab_diff) + skip_slab = true; + } do { struct mem_cgroup *root = sc->target_mem_cgroup; @@ -2556,6 +2637,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) .priority = sc->priority, }; unsigned long node_lru_pages = 0; + unsigned long slab_reclaimed = 0; + unsigned long slab_scanned = 0; struct mem_cgroup *memcg; nr_reclaimed = sc->nr_reclaimed; @@ -2580,9 +2663,16 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) shrink_node_memcg(pgdat, memcg, sc, &lru_pages); node_lru_pages += lru_pages; - if (memcg) - shrink_slab(sc->gfp_mask, pgdat->node_id, - memcg, sc->priority); + if (memcg && !skip_slab) { + int priority = sc->priority; + if (sc->inactive_only) + priority -= priority_adj; + priority = max(0, priority); + slab_reclaimed += + shrink_slab(sc->gfp_mask, + pgdat->node_id, memcg, + priority, &slab_scanned); + } /* Record the group's reclaim efficiency */ vmpressure(sc->gfp_mask, memcg, false, @@ -2606,9 +2696,16 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) } } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim))); - if (global_reclaim(sc)) - shrink_slab(sc->gfp_mask, pgdat->node_id, NULL, - sc->priority); + if (!skip_slab && global_reclaim(sc)) { + int priority = sc->priority; + if (sc->inactive_only) + priority -= priority_adj; + priority = max(0, priority); + slab_reclaimed += + shrink_slab(sc->gfp_mask, pgdat->node_id, NULL, + priority, &slab_scanned); + } + /* * Record the subtree's reclaim efficiency. The reclaimed @@ -2627,9 +2724,28 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) reclaim_state->reclaimed_slab = 0; } - if (sc->nr_reclaimed - nr_reclaimed) + if (sc->nr_reclaimed - nr_reclaimed) { reclaimable = true; + } else if (sc->inactive_only && !skip_slab) { + unsigned long percent = 0; + /* + * We didn't reclaim anything this go around, so the + * inactive list is likely spent. If we're reclaiming + * less than half of the objects in slab that we're + * scanning then just stop doing the inactive only scan. + * Otherwise ramp up the pressure on the slab caches + * hoping that eventually we'll start freeing enough + * objects to reclaim space. + */ + if (slab_scanned) + percent = slab_reclaimed * 100 / slab_scanned; + if (percent < 50) + sc->inactive_only = 0; + else + priority_adj++; + } + skip_slab = false; } while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed, sc->nr_scanned - nr_scanned, sc)); @@ -3272,7 +3388,8 @@ static bool kswapd_shrink_node(pg_data_t *pgdat, * or lower is eligible for reclaim until at least one usable zone is * balanced. */ -static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) +static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx, + unsigned long inactive_diff, unsigned long slab_diff) { int i; unsigned long nr_soft_reclaimed; @@ -3285,6 +3402,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) .may_writepage = !laptop_mode, .may_unmap = 1, .may_swap = 1, + .inactive_only = 1, + .inactive_diff = inactive_diff, + .slab_diff = slab_diff, }; count_vm_event(PAGEOUTRUN); @@ -3504,7 +3624,7 @@ static int kswapd(void *p) unsigned int classzone_idx = MAX_NR_ZONES - 1; pg_data_t *pgdat = (pg_data_t*)p; struct task_struct *tsk = current; - + unsigned long nr_slab = 0, nr_inactive = 0; struct reclaim_state reclaim_state = { .reclaimed_slab = 0, }; @@ -3534,6 +3654,7 @@ static int kswapd(void *p) pgdat->kswapd_order = 0; pgdat->kswapd_classzone_idx = MAX_NR_ZONES; for ( ; ; ) { + unsigned long slab_diff, inactive_diff; bool ret; alloc_order = reclaim_order = pgdat->kswapd_order; @@ -3561,6 +3682,23 @@ static int kswapd(void *p) continue; /* + * We want to know where we're adding pages so we can make + * smarter decisions about where we're going to put pressure + * when shrinking. + */ + slab_diff = sum_zone_node_page_state(pgdat->node_id, + NR_SLAB_RECLAIMABLE); + inactive_diff = node_page_state(pgdat, NR_INACTIVE_FILE); + if (nr_slab > slab_diff) + slab_diff = 0; + else + slab_diff -= nr_slab; + if (inactive_diff < nr_inactive) + inactive_diff = 0; + else + inactive_diff -= nr_inactive; + + /* * Reclaim begins at the requested order but if a high-order * reclaim fails then kswapd falls back to reclaiming for * order-0. If that happens, kswapd will consider sleeping @@ -3570,7 +3708,11 @@ static int kswapd(void *p) */ trace_mm_vmscan_kswapd_wake(pgdat->node_id, classzone_idx, alloc_order); - reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx); + reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx, + inactive_diff, slab_diff); + nr_inactive = node_page_state(pgdat, NR_INACTIVE_FILE); + nr_slab = sum_zone_node_page_state(pgdat->node_id, + NR_SLAB_RECLAIMABLE); if (reclaim_order < alloc_order) goto kswapd_try_sleep; } @@ -3822,6 +3964,8 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), .may_swap = 1, .reclaim_idx = gfp_zone(gfp_mask), + .slab_diff = 1, + .inactive_diff = 1, }; cond_resched(); -- 2.7.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f197.google.com (mail-wr0-f197.google.com [209.85.128.197]) by kanga.kvack.org (Postfix) with ESMTP id C303F6B04AC for ; Thu, 27 Jul 2017 19:55:44 -0400 (EDT) Received: by mail-wr0-f197.google.com with SMTP id u89so36827459wrc.1 for ; Thu, 27 Jul 2017 16:55:44 -0700 (PDT) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id 123si11778778wmk.3.2017.07.27.16.55.43 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 27 Jul 2017 16:55:43 -0700 (PDT) Date: Thu, 27 Jul 2017 16:55:41 -0700 From: Andrew Morton Subject: Re: [PATCH 0/2][V3] slab and general reclaim improvements Message-Id: <20170727165541.b9246aeb333227d7b36b5e8b@linux-foundation.org> In-Reply-To: <1500576331-31214-1-git-send-email-jbacik@fb.com> References: <1500576331-31214-1-git-send-email-jbacik@fb.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: josef@toxicpanda.com Cc: minchan@kernel.org, linux-mm@kvack.org, hannes@cmpxchg.org, riel@redhat.com, david@fromorbit.com, kernel-team@fb.com On Thu, 20 Jul 2017 14:45:29 -0400 josef@toxicpanda.com wrote: > This is a new set of patches to address some slab reclaim issues I observed when > trying to convert btrfs over to a purely slab meta data system. Without more review-n-test I'm a bit reluctant to apply these even on a see-how-it-goes basis. Anyone? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f198.google.com (mail-wr0-f198.google.com [209.85.128.198]) by kanga.kvack.org (Postfix) with ESMTP id 0187A6B04B7 for ; Thu, 27 Jul 2017 20:02:32 -0400 (EDT) Received: by mail-wr0-f198.google.com with SMTP id v31so36835742wrc.7 for ; Thu, 27 Jul 2017 17:02:31 -0700 (PDT) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id e13si4123457wmc.10.2017.07.27.16.53.50 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 27 Jul 2017 16:53:51 -0700 (PDT) Date: Thu, 27 Jul 2017 16:53:48 -0700 From: Andrew Morton Subject: Re: [PATCH 1/2] mm: use sc->priority for slab shrink targets Message-Id: <20170727165348.0e23487a9f98c359fbd5bfea@linux-foundation.org> In-Reply-To: <1500576331-31214-2-git-send-email-jbacik@fb.com> References: <1500576331-31214-1-git-send-email-jbacik@fb.com> <1500576331-31214-2-git-send-email-jbacik@fb.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: josef@toxicpanda.com Cc: minchan@kernel.org, linux-mm@kvack.org, hannes@cmpxchg.org, riel@redhat.com, david@fromorbit.com, kernel-team@fb.com, Josef Bacik On Thu, 20 Jul 2017 14:45:30 -0400 josef@toxicpanda.com wrote: > From: Josef Bacik > > Previously we were using the ratio of the number of lru pages scanned to > the number of eligible lru pages to determine the number of slab objects > to scan. The problem with this is that these two things have nothing to > do with each other, "nothing"? > so in slab heavy work loads where there is little to > no page cache we can end up with the pages scanned being a very low > number. In this case the "number of eligible lru pages" will also be low, so these things do have something to do with each other? > This means that we reclaim next to no slab pages and waste a > lot of time reclaiming small amounts of space. > > Instead use sc->priority in the same way we use it to determine scan > amounts for the lru's. That sounds like a good idea. Alternatively did you consider hooking into the vmpressure code (or hannes's new memdelay code) to determine how hard to scan slab? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yw0-f199.google.com (mail-yw0-f199.google.com [209.85.161.199]) by kanga.kvack.org (Postfix) with ESMTP id CE9506B0579 for ; Fri, 28 Jul 2017 19:52:52 -0400 (EDT) Received: by mail-yw0-f199.google.com with SMTP id t139so324615711ywg.6 for ; Fri, 28 Jul 2017 16:52:52 -0700 (PDT) Received: from mail-yw0-x244.google.com (mail-yw0-x244.google.com. [2607:f8b0:4002:c05::244]) by mx.google.com with ESMTPS id p17si5360917ybd.419.2017.07.28.16.52.51 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 28 Jul 2017 16:52:51 -0700 (PDT) Received: by mail-yw0-x244.google.com with SMTP id u207so9183434ywc.0 for ; Fri, 28 Jul 2017 16:52:51 -0700 (PDT) Date: Fri, 28 Jul 2017 23:52:50 +0000 From: Josef Bacik Subject: Re: [PATCH 1/2] mm: use sc->priority for slab shrink targets Message-ID: <20170728235248.GA27897@li70-116.members.linode.com> References: <1500576331-31214-1-git-send-email-jbacik@fb.com> <1500576331-31214-2-git-send-email-jbacik@fb.com> <20170727165348.0e23487a9f98c359fbd5bfea@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170727165348.0e23487a9f98c359fbd5bfea@linux-foundation.org> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: josef@toxicpanda.com, minchan@kernel.org, linux-mm@kvack.org, hannes@cmpxchg.org, riel@redhat.com, david@fromorbit.com, kernel-team@fb.com, Josef Bacik On Thu, Jul 27, 2017 at 04:53:48PM -0700, Andrew Morton wrote: > On Thu, 20 Jul 2017 14:45:30 -0400 josef@toxicpanda.com wrote: > > > From: Josef Bacik > > > > Previously we were using the ratio of the number of lru pages scanned to > > the number of eligible lru pages to determine the number of slab objects > > to scan. The problem with this is that these two things have nothing to > > do with each other, > > "nothing"? > > > so in slab heavy work loads where there is little to > > no page cache we can end up with the pages scanned being a very low > > number. > > In this case the "number of eligible lru pages" will also be low, so > these things do have something to do with each other? > The problem is scanned doesn't correlate to the scanned count we calculate, but rather the pages we're able to actually scan. With almost no page cache we end up with really low scanned counts to "relatively" high lru count, which makes the ratio really really low. Anecdotally we would have 10 million inodes in cache, but the ratios were such that our scan target was like 8k. > > This means that we reclaim next to no slab pages and waste a > > lot of time reclaiming small amounts of space. > > > > Instead use sc->priority in the same way we use it to determine scan > > amounts for the lru's. > > That sounds like a good idea. > > Alternatively did you consider hooking into the vmpressure code (or > hannes's new memdelay code) to determine how hard to scan slab? > Vmpressure requires memcg to be turned on. As for memdelay that might be a good direction in the future, but right now it's just per task. We could probably use it for direct reclaim, but I really want this to make kswapd better so we avoid direct reclaim. If it's expanded to be system wide so we could have an idea of the effect of memory reclaim on the whole system that would tie in nicely here. But for now I think staying consistent with everything else is good enough. Thanks, Josef -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qt0-f198.google.com (mail-qt0-f198.google.com [209.85.216.198]) by kanga.kvack.org (Postfix) with ESMTP id 6FC2328071E for ; Tue, 22 Aug 2017 15:35:42 -0400 (EDT) Received: by mail-qt0-f198.google.com with SMTP id 57so34523309qtu.4 for ; Tue, 22 Aug 2017 12:35:42 -0700 (PDT) Received: from mail-qt0-x243.google.com (mail-qt0-x243.google.com. [2607:f8b0:400d:c0d::243]) by mx.google.com with ESMTPS id z66si14854623qkb.286.2017.08.22.12.35.41 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 22 Aug 2017 12:35:41 -0700 (PDT) Received: by mail-qt0-x243.google.com with SMTP id 57so8189646qtu.0 for ; Tue, 22 Aug 2017 12:35:41 -0700 (PDT) From: josef@toxicpanda.com Subject: [PATCH 1/2] mm: use sc->priority for slab shrink targets Date: Tue, 22 Aug 2017 15:35:38 -0400 Message-Id: <1503430539-2878-1-git-send-email-jbacik@fb.com> Sender: owner-linux-mm@kvack.org List-ID: To: minchan@kernel.org, linux-mm@kvack.org, hannes@cmpxchg.org, riel@redhat.com, akpm@linux-foundation.org, david@fromorbit.com, kernel-team@fb.com, aryabinin@virtuozzo.com Cc: Josef Bacik From: Josef Bacik Previously we were using the ratio of the number of lru pages scanned to the number of eligible lru pages to determine the number of slab objects to scan. The problem with this is that these two things have nothing to do with each other, so in slab heavy work loads where there is little to no page cache we can end up with the pages scanned being a very low number. This means that we reclaim next to no slab pages and waste a lot of time reclaiming small amounts of space. Instead use sc->priority in the same way we use it to determine scan amounts for the lru's. This generally equates to pages. Consider the following slab_pages = (nr_objects * object_size) / PAGE_SIZE What we would like to do is scan = slab_pages >> sc->priority but we don't know the number of slab pages each shrinker controls, only the objects. However say that theoretically we knew how many pages a shrinker controlled, we'd still have to convert this to objects, which would look like the following scan = shrinker_pages >> sc->priority scan_objects = (PAGE_SIZE / object_size) * scan or written another way scan_objects = (shrinker_pages >> sc->priority) * (PAGE_SIZE / object_size) which can thus be written scan_objects = ((shrinker_pages * PAGE_SIZE) / object_size) >> sc->priority which is just scan_objects = nr_objects >> sc->priority We don't need to know exactly how many pages each shrinker represents, it's objects are all the information we need. Making this change allows us to place an appropriate amount of pressure on the shrinker pools for their relative size. Signed-off-by: Josef Bacik --- include/trace/events/vmscan.h | 23 ++++++++++------------ mm/vmscan.c | 46 +++++++++++-------------------------------- 2 files changed, 22 insertions(+), 47 deletions(-) diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h index 27e8a5c..8c5a00a 100644 --- a/include/trace/events/vmscan.h +++ b/include/trace/events/vmscan.h @@ -187,12 +187,12 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_memcg_softlimit_re TRACE_EVENT(mm_shrink_slab_start, TP_PROTO(struct shrinker *shr, struct shrink_control *sc, - long nr_objects_to_shrink, unsigned long pgs_scanned, - unsigned long lru_pgs, unsigned long cache_items, - unsigned long long delta, unsigned long total_scan), + long nr_objects_to_shrink, unsigned long cache_items, + unsigned long long delta, unsigned long total_scan, + int priority), - TP_ARGS(shr, sc, nr_objects_to_shrink, pgs_scanned, lru_pgs, - cache_items, delta, total_scan), + TP_ARGS(shr, sc, nr_objects_to_shrink, cache_items, delta, total_scan, + priority), TP_STRUCT__entry( __field(struct shrinker *, shr) @@ -200,11 +200,10 @@ TRACE_EVENT(mm_shrink_slab_start, __field(int, nid) __field(long, nr_objects_to_shrink) __field(gfp_t, gfp_flags) - __field(unsigned long, pgs_scanned) - __field(unsigned long, lru_pgs) __field(unsigned long, cache_items) __field(unsigned long long, delta) __field(unsigned long, total_scan) + __field(int, priority) ), TP_fast_assign( @@ -213,24 +212,22 @@ TRACE_EVENT(mm_shrink_slab_start, __entry->nid = sc->nid; __entry->nr_objects_to_shrink = nr_objects_to_shrink; __entry->gfp_flags = sc->gfp_mask; - __entry->pgs_scanned = pgs_scanned; - __entry->lru_pgs = lru_pgs; __entry->cache_items = cache_items; __entry->delta = delta; __entry->total_scan = total_scan; + __entry->priority = priority; ), - TP_printk("%pF %p: nid: %d objects to shrink %ld gfp_flags %s pgs_scanned %ld lru_pgs %ld cache items %ld delta %lld total_scan %ld", + TP_printk("%pF %p: nid: %d objects to shrink %ld gfp_flags %s cache items %ld delta %lld total_scan %ld priority %d", __entry->shrink, __entry->shr, __entry->nid, __entry->nr_objects_to_shrink, show_gfp_flags(__entry->gfp_flags), - __entry->pgs_scanned, - __entry->lru_pgs, __entry->cache_items, __entry->delta, - __entry->total_scan) + __entry->total_scan, + __entry->priority) ); TRACE_EVENT(mm_shrink_slab_end, diff --git a/mm/vmscan.c b/mm/vmscan.c index 734e8d3..608dfe6 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -306,9 +306,7 @@ EXPORT_SYMBOL(unregister_shrinker); #define SHRINK_BATCH 128 static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, - struct shrinker *shrinker, - unsigned long nr_scanned, - unsigned long nr_eligible) + struct shrinker *shrinker, int priority) { unsigned long freed = 0; unsigned long long delta; @@ -333,9 +331,8 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0); total_scan = nr; - delta = (4 * nr_scanned) / shrinker->seeks; - delta *= freeable; - do_div(delta, nr_eligible + 1); + delta = freeable >> priority; + delta = (4 * freeable) / shrinker->seeks; total_scan += delta; if (total_scan < 0) { pr_err("shrink_slab: %pF negative objects to delete nr=%ld\n", @@ -369,8 +366,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, total_scan = freeable * 2; trace_mm_shrink_slab_start(shrinker, shrinkctl, nr, - nr_scanned, nr_eligible, - freeable, delta, total_scan); + freeable, delta, total_scan, priority); /* * Normally, we should not scan less than batch_size objects in one @@ -429,8 +425,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, * @gfp_mask: allocation context * @nid: node whose slab caches to target * @memcg: memory cgroup whose slab caches to target - * @nr_scanned: pressure numerator - * @nr_eligible: pressure denominator + * @priority: the reclaim priority * * Call the shrink functions to age shrinkable caches. * @@ -442,20 +437,14 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, * objects from the memory cgroup specified. Otherwise, only unaware * shrinkers are called. * - * @nr_scanned and @nr_eligible form a ratio that indicate how much of - * the available objects should be scanned. Page reclaim for example - * passes the number of pages scanned and the number of pages on the - * LRU lists that it considered on @nid, plus a bias in @nr_scanned - * when it encountered mapped pages. The ratio is further biased by - * the ->seeks setting of the shrink function, which indicates the - * cost to recreate an object relative to that of an LRU page. + * @priority is sc->priority, we take the number of objects and >> by priority + * in order to get the scan target. * * Returns the number of reclaimed slab objects. */ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg, - unsigned long nr_scanned, - unsigned long nr_eligible) + int priority) { struct shrinker *shrinker; unsigned long freed = 0; @@ -463,9 +452,6 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, if (memcg && (!memcg_kmem_enabled() || !mem_cgroup_online(memcg))) return 0; - if (nr_scanned == 0) - nr_scanned = SWAP_CLUSTER_MAX; - if (!down_read_trylock(&shrinker_rwsem)) { /* * If we would return 0, our callers would understand that we @@ -496,7 +482,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) sc.nid = 0; - freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible); + freed += do_shrink_slab(&sc, shrinker, priority); } up_read(&shrinker_rwsem); @@ -514,8 +500,7 @@ void drop_slab_node(int nid) freed = 0; do { - freed += shrink_slab(GFP_KERNEL, nid, memcg, - 1000, 1000); + freed += shrink_slab(GFP_KERNEL, nid, memcg, 0); } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL); } while (freed > 10); } @@ -2610,14 +2595,12 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) reclaimed = sc->nr_reclaimed; scanned = sc->nr_scanned; - shrink_node_memcg(pgdat, memcg, sc, &lru_pages); node_lru_pages += lru_pages; if (memcg) shrink_slab(sc->gfp_mask, pgdat->node_id, - memcg, sc->nr_scanned - scanned, - lru_pages); + memcg, sc->priority); /* Record the group's reclaim efficiency */ vmpressure(sc->gfp_mask, memcg, false, @@ -2641,14 +2624,9 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) } } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim))); - /* - * Shrink the slab caches in the same proportion that - * the eligible LRU pages were scanned. - */ if (global_reclaim(sc)) shrink_slab(sc->gfp_mask, pgdat->node_id, NULL, - sc->nr_scanned - nr_scanned, - node_lru_pages); + sc->priority); /* * Record the subtree's reclaim efficiency. The reclaimed -- 2.7.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f70.google.com (mail-pg0-f70.google.com [74.125.83.70]) by kanga.kvack.org (Postfix) with ESMTP id 74480280704 for ; Thu, 24 Aug 2017 03:08:14 -0400 (EDT) Received: by mail-pg0-f70.google.com with SMTP id m133so32962649pga.2 for ; Thu, 24 Aug 2017 00:08:14 -0700 (PDT) Received: from mail-pg0-x243.google.com (mail-pg0-x243.google.com. [2607:f8b0:400e:c05::243]) by mx.google.com with ESMTPS id a41si106919pli.340.2017.08.24.00.08.13 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 24 Aug 2017 00:08:13 -0700 (PDT) Received: by mail-pg0-x243.google.com with SMTP id p14so2686472pgd.1 for ; Thu, 24 Aug 2017 00:08:13 -0700 (PDT) Date: Thu, 24 Aug 2017 16:08:01 +0900 From: Minchan Kim Subject: Re: [PATCH 1/2] mm: use sc->priority for slab shrink targets Message-ID: <20170824070801.GA20463@bgram> References: <1503430539-2878-1-git-send-email-jbacik@fb.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1503430539-2878-1-git-send-email-jbacik@fb.com> Sender: owner-linux-mm@kvack.org List-ID: To: josef@toxicpanda.com Cc: linux-mm@kvack.org, hannes@cmpxchg.org, riel@redhat.com, akpm@linux-foundation.org, david@fromorbit.com, kernel-team@fb.com, aryabinin@virtuozzo.com, Josef Bacik On Tue, Aug 22, 2017 at 03:35:38PM -0400, josef@toxicpanda.com wrote: > From: Josef Bacik > > Previously we were using the ratio of the number of lru pages scanned to > the number of eligible lru pages to determine the number of slab objects > to scan. The problem with this is that these two things have nothing to > do with each other, so in slab heavy work loads where there is little to > no page cache we can end up with the pages scanned being a very low > number. This means that we reclaim next to no slab pages and waste a > lot of time reclaiming small amounts of space. Your answer on the question I asked will help to parse this paragraph. Quote from previous discussion: " where sc->priority starts at DEF_PRIORITY, which is 12. The first loop through reclaim would result in a scan target of 2 pages to 11715 total inactive pages, and 3 pages to 14710 total active pages. This is a really really small target for a system that is entirely slab pages. And this is super optimistic, this assumes we even get to scan these pages. We don't increment sc->nr_scanned unless we 1) isolate the page, which assumes it's not in use, and 2) can lock the page. Under pressure these numbers could probably go down, I'm sure there's some random pages from daemons that aren't actually in use, so the targets get even smaller. " Please add it to the description. > > Instead use sc->priority in the same way we use it to determine scan > amounts for the lru's. This generally equates to pages. Consider the > following > > slab_pages = (nr_objects * object_size) / PAGE_SIZE > > What we would like to do is > > scan = slab_pages >> sc->priority > > but we don't know the number of slab pages each shrinker controls, only > the objects. However say that theoretically we knew how many pages a > shrinker controlled, we'd still have to convert this to objects, which > would look like the following > > scan = shrinker_pages >> sc->priority > scan_objects = (PAGE_SIZE / object_size) * scan > > or written another way > > scan_objects = (shrinker_pages >> sc->priority) * > (PAGE_SIZE / object_size) > > which can thus be written > > scan_objects = ((shrinker_pages * PAGE_SIZE) / object_size) >> > sc->priority > > which is just > > scan_objects = nr_objects >> sc->priority > > We don't need to know exactly how many pages each shrinker represents, > it's objects are all the information we need. Making this change allows > us to place an appropriate amount of pressure on the shrinker pools for > their relative size. > > Signed-off-by: Josef Bacik Acked-by: Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f72.google.com (mail-pg0-f72.google.com [74.125.83.72]) by kanga.kvack.org (Postfix) with ESMTP id 822A72808A4 for ; Thu, 24 Aug 2017 10:27:23 -0400 (EDT) Received: by mail-pg0-f72.google.com with SMTP id m7so3731878pga.8 for ; Thu, 24 Aug 2017 07:27:23 -0700 (PDT) Received: from EUR03-AM5-obe.outbound.protection.outlook.com (mail-eopbgr30131.outbound.protection.outlook.com. [40.107.3.131]) by mx.google.com with ESMTPS id y94si3049213plh.826.2017.08.24.07.27.22 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Thu, 24 Aug 2017 07:27:22 -0700 (PDT) Subject: Re: [PATCH 1/2] mm: use sc->priority for slab shrink targets References: <1503430539-2878-1-git-send-email-jbacik@fb.com> From: Andrey Ryabinin Message-ID: Date: Thu, 24 Aug 2017 17:29:59 +0300 MIME-Version: 1.0 In-Reply-To: <1503430539-2878-1-git-send-email-jbacik@fb.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: josef@toxicpanda.com, minchan@kernel.org, linux-mm@kvack.org, hannes@cmpxchg.org, riel@redhat.com, akpm@linux-foundation.org, david@fromorbit.com, kernel-team@fb.com Cc: Josef Bacik On 08/22/2017 10:35 PM, josef@toxicpanda.com wrote: > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -306,9 +306,7 @@ EXPORT_SYMBOL(unregister_shrinker); > #define SHRINK_BATCH 128 > > static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, > - struct shrinker *shrinker, > - unsigned long nr_scanned, > - unsigned long nr_eligible) > + struct shrinker *shrinker, int priority) > { > unsigned long freed = 0; > unsigned long long delta; > @@ -333,9 +331,8 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, > nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0); > > total_scan = nr; > - delta = (4 * nr_scanned) / shrinker->seeks; > - delta *= freeable; > - do_div(delta, nr_eligible + 1); > + delta = freeable >> priority; > + delta = (4 * freeable) / shrinker->seeks; Something is wrong. The first line does nothing. > total_scan += delta; > if (total_scan < 0) { > pr_err("shrink_slab: %pF negative objects to delete nr=%ld\n", -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yw0-f197.google.com (mail-yw0-f197.google.com [209.85.161.197]) by kanga.kvack.org (Postfix) with ESMTP id E3B7C6B04BF for ; Thu, 24 Aug 2017 10:49:28 -0400 (EDT) Received: by mail-yw0-f197.google.com with SMTP id p6so3855677ywh.8 for ; Thu, 24 Aug 2017 07:49:28 -0700 (PDT) Received: from mail-qk0-x241.google.com (mail-qk0-x241.google.com. [2607:f8b0:400d:c09::241]) by mx.google.com with ESMTPS id e130si1103821ywc.458.2017.08.24.07.49.27 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 24 Aug 2017 07:49:27 -0700 (PDT) Received: by mail-qk0-x241.google.com with SMTP id s3so2429060qkd.1 for ; Thu, 24 Aug 2017 07:49:27 -0700 (PDT) Date: Thu, 24 Aug 2017 10:49:25 -0400 From: Josef Bacik Subject: Re: [PATCH 1/2] mm: use sc->priority for slab shrink targets Message-ID: <20170824144924.w3inhdnmgfscso7l@destiny> References: <1503430539-2878-1-git-send-email-jbacik@fb.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Andrey Ryabinin Cc: josef@toxicpanda.com, minchan@kernel.org, linux-mm@kvack.org, hannes@cmpxchg.org, riel@redhat.com, akpm@linux-foundation.org, david@fromorbit.com, kernel-team@fb.com, Josef Bacik On Thu, Aug 24, 2017 at 05:29:59PM +0300, Andrey Ryabinin wrote: > > > On 08/22/2017 10:35 PM, josef@toxicpanda.com wrote: > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -306,9 +306,7 @@ EXPORT_SYMBOL(unregister_shrinker); > > #define SHRINK_BATCH 128 > > > > static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, > > - struct shrinker *shrinker, > > - unsigned long nr_scanned, > > - unsigned long nr_eligible) > > + struct shrinker *shrinker, int priority) > > { > > unsigned long freed = 0; > > unsigned long long delta; > > @@ -333,9 +331,8 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, > > nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0); > > > > total_scan = nr; > > - delta = (4 * nr_scanned) / shrinker->seeks; > > - delta *= freeable; > > - do_div(delta, nr_eligible + 1); > > + delta = freeable >> priority; > > + delta = (4 * freeable) / shrinker->seeks; > > Something is wrong. The first line does nothing. > Lol jesus, nice catch, I'll fix this up. Thanks, Josef -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f72.google.com (mail-pg0-f72.google.com [74.125.83.72]) by kanga.kvack.org (Postfix) with ESMTP id CE148440882 for ; Thu, 24 Aug 2017 18:16:04 -0400 (EDT) Received: by mail-pg0-f72.google.com with SMTP id t193so2802322pgc.4 for ; Thu, 24 Aug 2017 15:16:04 -0700 (PDT) Received: from ipmail01.adl2.internode.on.net (ipmail01.adl2.internode.on.net. [150.101.137.133]) by mx.google.com with ESMTP id 33si3707393plg.445.2017.08.24.15.16.02 for ; Thu, 24 Aug 2017 15:16:03 -0700 (PDT) Date: Fri, 25 Aug 2017 08:15:59 +1000 From: Dave Chinner Subject: Re: [PATCH 1/2] mm: use sc->priority for slab shrink targets Message-ID: <20170824221559.GF21024@dastard> References: <1503430539-2878-1-git-send-email-jbacik@fb.com> <20170824144924.w3inhdnmgfscso7l@destiny> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170824144924.w3inhdnmgfscso7l@destiny> Sender: owner-linux-mm@kvack.org List-ID: To: Josef Bacik Cc: Andrey Ryabinin , minchan@kernel.org, linux-mm@kvack.org, hannes@cmpxchg.org, riel@redhat.com, akpm@linux-foundation.org, kernel-team@fb.com, Josef Bacik On Thu, Aug 24, 2017 at 10:49:25AM -0400, Josef Bacik wrote: > On Thu, Aug 24, 2017 at 05:29:59PM +0300, Andrey Ryabinin wrote: > > > > > > On 08/22/2017 10:35 PM, josef@toxicpanda.com wrote: > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -306,9 +306,7 @@ EXPORT_SYMBOL(unregister_shrinker); > > > #define SHRINK_BATCH 128 > > > > > > static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, > > > - struct shrinker *shrinker, > > > - unsigned long nr_scanned, > > > - unsigned long nr_eligible) > > > + struct shrinker *shrinker, int priority) > > > { > > > unsigned long freed = 0; > > > unsigned long long delta; > > > @@ -333,9 +331,8 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, > > > nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0); > > > > > > total_scan = nr; > > > - delta = (4 * nr_scanned) / shrinker->seeks; > > > - delta *= freeable; > > > - do_div(delta, nr_eligible + 1); > > > + delta = freeable >> priority; > > > + delta = (4 * freeable) / shrinker->seeks; > > > > Something is wrong. The first line does nothing. > > > > Lol jesus, nice catch, I'll fix this up. Thanks, Josef, this bug has been in every patch you've sent. What does fixing it do to the behaviour of the algorithm now? It's going to change it, for sure, so can you run all your behavioural characterisation tests and let us know what the difference between the broken and fixed patches are? Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f199.google.com (mail-qk0-f199.google.com [209.85.220.199]) by kanga.kvack.org (Postfix) with ESMTP id 00A17440882 for ; Thu, 24 Aug 2017 18:45:49 -0400 (EDT) Received: by mail-qk0-f199.google.com with SMTP id o184so4205480qkc.6 for ; Thu, 24 Aug 2017 15:45:48 -0700 (PDT) Received: from mail-qt0-x242.google.com (mail-qt0-x242.google.com. [2607:f8b0:400d:c0d::242]) by mx.google.com with ESMTPS id g11si3257275qtc.127.2017.08.24.15.45.47 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 24 Aug 2017 15:45:47 -0700 (PDT) Received: by mail-qt0-x242.google.com with SMTP id e2so393769qta.3 for ; Thu, 24 Aug 2017 15:45:47 -0700 (PDT) Date: Thu, 24 Aug 2017 18:45:46 -0400 From: Josef Bacik Subject: Re: [PATCH 1/2] mm: use sc->priority for slab shrink targets Message-ID: <20170824224544.3yrsl36u536fgkhc@destiny> References: <1503430539-2878-1-git-send-email-jbacik@fb.com> <20170824144924.w3inhdnmgfscso7l@destiny> <20170824221559.GF21024@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170824221559.GF21024@dastard> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Chinner Cc: Josef Bacik , Andrey Ryabinin , minchan@kernel.org, linux-mm@kvack.org, hannes@cmpxchg.org, riel@redhat.com, akpm@linux-foundation.org, kernel-team@fb.com, Josef Bacik On Fri, Aug 25, 2017 at 08:15:59AM +1000, Dave Chinner wrote: > On Thu, Aug 24, 2017 at 10:49:25AM -0400, Josef Bacik wrote: > > On Thu, Aug 24, 2017 at 05:29:59PM +0300, Andrey Ryabinin wrote: > > > > > > > > > On 08/22/2017 10:35 PM, josef@toxicpanda.com wrote: > > > > --- a/mm/vmscan.c > > > > +++ b/mm/vmscan.c > > > > @@ -306,9 +306,7 @@ EXPORT_SYMBOL(unregister_shrinker); > > > > #define SHRINK_BATCH 128 > > > > > > > > static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, > > > > - struct shrinker *shrinker, > > > > - unsigned long nr_scanned, > > > > - unsigned long nr_eligible) > > > > + struct shrinker *shrinker, int priority) > > > > { > > > > unsigned long freed = 0; > > > > unsigned long long delta; > > > > @@ -333,9 +331,8 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, > > > > nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0); > > > > > > > > total_scan = nr; > > > > - delta = (4 * nr_scanned) / shrinker->seeks; > > > > - delta *= freeable; > > > > - do_div(delta, nr_eligible + 1); > > > > + delta = freeable >> priority; > > > > + delta = (4 * freeable) / shrinker->seeks; > > > > > > Something is wrong. The first line does nothing. > > > > > > > Lol jesus, nice catch, I'll fix this up. Thanks, > > Josef, this bug has been in every patch you've sent. What does > fixing it do to the behaviour of the algorithm now? It's going to > change it, for sure, so can you run all your behavioural > characterisation tests and let us know what the difference between > the broken and fixed patches are? The bug made it so we were way more agressive with slab reclaim than we should be. My second patch masked this with the inactive/slab diff stuff, but I've dropped that patch since its controversial and I don't really care to argue about it anymore. This patch still fixes the issue of us not reclaiming enough in slab mostly workloads, I ran my fs_mark test before I sent out the new version to verify there still isn't the huge drop in performance once reclaim kicks in. Without any changes we reclaimed basically no slab, with the bug in place (without my second patch) we reclaimed all of the slab in one go, and with the fixed patch we reclaim a proportional amount each time we enter the shrinker. Thanks, Josef -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f69.google.com (mail-pg0-f69.google.com [74.125.83.69]) by kanga.kvack.org (Postfix) with ESMTP id 8F74844088B for ; Thu, 24 Aug 2017 21:43:37 -0400 (EDT) Received: by mail-pg0-f69.google.com with SMTP id q68so4319687pgq.11 for ; Thu, 24 Aug 2017 18:43:37 -0700 (PDT) Received: from ipmail01.adl2.internode.on.net (ipmail01.adl2.internode.on.net. [150.101.137.133]) by mx.google.com with ESMTP id u5si3979318plm.505.2017.08.24.18.43.33 for ; Thu, 24 Aug 2017 18:43:35 -0700 (PDT) Date: Fri, 25 Aug 2017 11:40:09 +1000 From: Dave Chinner Subject: Re: [PATCH 1/2] mm: use sc->priority for slab shrink targets Message-ID: <20170825014009.GL21024@dastard> References: <1503430539-2878-1-git-send-email-jbacik@fb.com> <20170824144924.w3inhdnmgfscso7l@destiny> <20170824221559.GF21024@dastard> <20170824224544.3yrsl36u536fgkhc@destiny> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170824224544.3yrsl36u536fgkhc@destiny> Sender: owner-linux-mm@kvack.org List-ID: To: Josef Bacik Cc: Andrey Ryabinin , minchan@kernel.org, linux-mm@kvack.org, hannes@cmpxchg.org, riel@redhat.com, akpm@linux-foundation.org, kernel-team@fb.com, Josef Bacik On Thu, Aug 24, 2017 at 06:45:46PM -0400, Josef Bacik wrote: > On Fri, Aug 25, 2017 at 08:15:59AM +1000, Dave Chinner wrote: > > On Thu, Aug 24, 2017 at 10:49:25AM -0400, Josef Bacik wrote: > > > On Thu, Aug 24, 2017 at 05:29:59PM +0300, Andrey Ryabinin wrote: > > > > > > > > > > > > On 08/22/2017 10:35 PM, josef@toxicpanda.com wrote: > > > > > --- a/mm/vmscan.c > > > > > +++ b/mm/vmscan.c > > > > > @@ -306,9 +306,7 @@ EXPORT_SYMBOL(unregister_shrinker); > > > > > #define SHRINK_BATCH 128 > > > > > > > > > > static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, > > > > > - struct shrinker *shrinker, > > > > > - unsigned long nr_scanned, > > > > > - unsigned long nr_eligible) > > > > > + struct shrinker *shrinker, int priority) > > > > > { > > > > > unsigned long freed = 0; > > > > > unsigned long long delta; > > > > > @@ -333,9 +331,8 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, > > > > > nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0); > > > > > > > > > > total_scan = nr; > > > > > - delta = (4 * nr_scanned) / shrinker->seeks; > > > > > - delta *= freeable; > > > > > - do_div(delta, nr_eligible + 1); > > > > > + delta = freeable >> priority; > > > > > + delta = (4 * freeable) / shrinker->seeks; > > > > > > > > Something is wrong. The first line does nothing. > > > > > > > > > > Lol jesus, nice catch, I'll fix this up. Thanks, > > > > Josef, this bug has been in every patch you've sent. What does > > fixing it do to the behaviour of the algorithm now? It's going to > > change it, for sure, so can you run all your behavioural > > characterisation tests and let us know what the difference between > > the broken and fixed patches are? > > The bug made it so we were way more agressive with slab reclaim than we should > be. My second patch masked this with the inactive/slab diff stuff, but I've > dropped that patch since its controversial and I don't really care to argue > about it anymore. This patch still fixes the issue of us not reclaiming enough > in slab mostly workloads, I ran my fs_mark test before I sent out the new > version to verify there still isn't the huge drop in performance once reclaim > kicks in. Without any changes we reclaimed basically no slab, with the bug in > place (without my second patch) we reclaimed all of the slab in one go, and with > the fixed patch we reclaim a proportional amount each time we enter the > shrinker. Thanks, Cool, thanks for verifying it works as intended now, Josef. :P Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org