* [PATCH v3] memcg: add memory.vmscan_stat @ 2011-07-22 8:15 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-07-22 8:15 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel, nishimura, Michal Hocko, akpm, abrestic [PATCH] add memory.vmscan_stat commit log of commit 0ae5e89 " memcg: count the soft_limit reclaim in..." says it adds scanning stats to memory.stat file. But it doesn't because we considered we needed to make a concensus for such new APIs. This patch is a trial to add memory.scan_stat. This shows - the number of scanned pages(total, anon, file) - the number of rotated pages(total, anon, file) - the number of freed pages(total, anon, file) - the number of elaplsed time (including sleep/pause time) for both of direct/soft reclaim. The biggest difference with oringinal Ying's one is that this file can be reset by some write, as # echo 0 ...../memory.scan_stat Example of output is here. This is a result after make -j 6 kernel under 300M limit. [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat scanned_pages_by_limit 9471864 scanned_anon_pages_by_limit 6640629 scanned_file_pages_by_limit 2831235 rotated_pages_by_limit 4243974 rotated_anon_pages_by_limit 3971968 rotated_file_pages_by_limit 272006 freed_pages_by_limit 2318492 freed_anon_pages_by_limit 962052 freed_file_pages_by_limit 1356440 elapsed_ns_by_limit 351386416101 scanned_pages_by_system 0 scanned_anon_pages_by_system 0 scanned_file_pages_by_system 0 rotated_pages_by_system 0 rotated_anon_pages_by_system 0 rotated_file_pages_by_system 0 freed_pages_by_system 0 freed_anon_pages_by_system 0 freed_file_pages_by_system 0 elapsed_ns_by_system 0 scanned_pages_by_limit_under_hierarchy 9471864 scanned_anon_pages_by_limit_under_hierarchy 6640629 scanned_file_pages_by_limit_under_hierarchy 2831235 rotated_pages_by_limit_under_hierarchy 4243974 rotated_anon_pages_by_limit_under_hierarchy 3971968 rotated_file_pages_by_limit_under_hierarchy 272006 freed_pages_by_limit_under_hierarchy 2318492 freed_anon_pages_by_limit_under_hierarchy 962052 freed_file_pages_by_limit_under_hierarchy 1356440 elapsed_ns_by_limit_under_hierarchy 351386416101 scanned_pages_by_system_under_hierarchy 0 scanned_anon_pages_by_system_under_hierarchy 0 scanned_file_pages_by_system_under_hierarchy 0 rotated_pages_by_system_under_hierarchy 0 rotated_anon_pages_by_system_under_hierarchy 0 rotated_file_pages_by_system_under_hierarchy 0 freed_pages_by_system_under_hierarchy 0 freed_anon_pages_by_system_under_hierarchy 0 freed_file_pages_by_system_under_hierarchy 0 elapsed_ns_by_system_under_hierarchy 0 total_xxxx is for hierarchy management. This will be useful for further memcg developments and need to be developped before we do some complicated rework on LRU/softlimit management. This patch adds a new struct memcg_scanrecord into scan_control struct. sc->nr_scanned at el is not designed for exporting information. For example, nr_scanned is reset frequentrly and incremented +2 at scanning mapped pages. For avoiding complexity, I added a new param in scan_control which is for exporting scanning score. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Changelog: - fixed the trigger for recording nr_freed in shrink_inactive_list() Changelog: - renamed as vmscan_stat - handle file/anon - added "rotated" - changed names of param in vmscan_stat. --- Documentation/cgroups/memory.txt | 85 +++++++++++++++++++ include/linux/memcontrol.h | 19 ++++ include/linux/swap.h | 6 - mm/memcontrol.c | 172 +++++++++++++++++++++++++++++++++++++-- mm/vmscan.c | 39 +++++++- 5 files changed, 303 insertions(+), 18 deletions(-) Index: mmotm-0710/Documentation/cgroups/memory.txt =================================================================== --- mmotm-0710.orig/Documentation/cgroups/memory.txt +++ mmotm-0710/Documentation/cgroups/memory.txt @@ -380,7 +380,7 @@ will be charged as a new owner of it. 5.2 stat file -memory.stat file includes following statistics +5.2.1 memory.stat file includes following statistics # per-memory cgroup local status cache - # of bytes of page cache memory. @@ -438,6 +438,89 @@ Note: file_mapped is accounted only when the memory cgroup is owner of page cache.) +5.2.2 memory.vmscan_stat + +memory.vmscan_stat includes statistics information for memory scanning and +freeing, reclaiming. The statistics shows memory scanning information since +memory cgroup creation and can be reset to 0 by writing 0 as + + #echo 0 > ../memory.vmscan_stat + +This file contains following statistics. + +[param]_[file_or_anon]_pages_by_[reason]_[under_heararchy] +[param]_elapsed_ns_by_[reason]_[under_hierarchy] + +For example, + + scanned_file_pages_by_limit indicates the number of scanned + file pages at vmscan. + +Now, 3 parameters are supported + + scanned - the number of pages scanned by vmscan + rotated - the number of pages activated at vmscan + freed - the number of pages freed by vmscan + +If "rotated" is high against scanned/freed, the memcg seems busy. + +Now, 2 reason are supported + + limit - the memory cgroup's limit + system - global memory pressure + softlimit + (global memory pressure not under softlimit is not handled now) + +When under_hierarchy is added in the tail, the number indicates the +total memcg scan of its children and itself. + +elapsed_ns is a elapsed time in nanosecond. This may include sleep time +and not indicates CPU usage. So, please take this as just showing +latency. + +Here is an example. + +# cat /cgroup/memory/A/memory.vmscan_stat +scanned_pages_by_limit 9471864 +scanned_anon_pages_by_limit 6640629 +scanned_file_pages_by_limit 2831235 +rotated_pages_by_limit 4243974 +rotated_anon_pages_by_limit 3971968 +rotated_file_pages_by_limit 272006 +freed_pages_by_limit 2318492 +freed_anon_pages_by_limit 962052 +freed_file_pages_by_limit 1356440 +elapsed_ns_by_limit 351386416101 +scanned_pages_by_system 0 +scanned_anon_pages_by_system 0 +scanned_file_pages_by_system 0 +rotated_pages_by_system 0 +rotated_anon_pages_by_system 0 +rotated_file_pages_by_system 0 +freed_pages_by_system 0 +freed_anon_pages_by_system 0 +freed_file_pages_by_system 0 +elapsed_ns_by_system 0 +scanned_pages_by_limit_under_hierarchy 9471864 +scanned_anon_pages_by_limit_under_hierarchy 6640629 +scanned_file_pages_by_limit_under_hierarchy 2831235 +rotated_pages_by_limit_under_hierarchy 4243974 +rotated_anon_pages_by_limit_under_hierarchy 3971968 +rotated_file_pages_by_limit_under_hierarchy 272006 +freed_pages_by_limit_under_hierarchy 2318492 +freed_anon_pages_by_limit_under_hierarchy 962052 +freed_file_pages_by_limit_under_hierarchy 1356440 +elapsed_ns_by_limit_under_hierarchy 351386416101 +scanned_pages_by_system_under_hierarchy 0 +scanned_anon_pages_by_system_under_hierarchy 0 +scanned_file_pages_by_system_under_hierarchy 0 +rotated_pages_by_system_under_hierarchy 0 +rotated_anon_pages_by_system_under_hierarchy 0 +rotated_file_pages_by_system_under_hierarchy 0 +freed_pages_by_system_under_hierarchy 0 +freed_anon_pages_by_system_under_hierarchy 0 +freed_file_pages_by_system_under_hierarchy 0 +elapsed_ns_by_system_under_hierarchy 0 + 5.3 swappiness Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only. Index: mmotm-0710/include/linux/memcontrol.h =================================================================== --- mmotm-0710.orig/include/linux/memcontrol.h +++ mmotm-0710/include/linux/memcontrol.h @@ -39,6 +39,16 @@ extern unsigned long mem_cgroup_isolate_ struct mem_cgroup *mem_cont, int active, int file); +struct memcg_scanrecord { + struct mem_cgroup *mem; /* scanend memory cgroup */ + struct mem_cgroup *root; /* scan target hierarchy root */ + int context; /* scanning context (see memcontrol.c) */ + unsigned long nr_scanned[2]; /* the number of scanned pages */ + unsigned long nr_rotated[2]; /* the number of rotated pages */ + unsigned long nr_freed[2]; /* the number of freed pages */ + unsigned long elapsed; /* nsec of time elapsed while scanning */ +}; + #ifdef CONFIG_CGROUP_MEM_RES_CTLR /* * All "charge" functions with gfp_mask should use GFP_KERNEL or @@ -117,6 +127,15 @@ mem_cgroup_get_reclaim_stat_from_page(st extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); +extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem, + gfp_t gfp_mask, bool noswap, + struct memcg_scanrecord *rec); +extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, + gfp_t gfp_mask, bool noswap, + struct zone *zone, + struct memcg_scanrecord *rec, + unsigned long *nr_scanned); + #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP extern int do_swap_account; #endif Index: mmotm-0710/include/linux/swap.h =================================================================== --- mmotm-0710.orig/include/linux/swap.h +++ mmotm-0710/include/linux/swap.h @@ -253,12 +253,6 @@ static inline void lru_cache_add_file(st /* linux/mm/vmscan.c */ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *mask); -extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem, - gfp_t gfp_mask, bool noswap); -extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, - gfp_t gfp_mask, bool noswap, - struct zone *zone, - unsigned long *nr_scanned); extern int __isolate_lru_page(struct page *page, int mode, int file); extern unsigned long shrink_all_memory(unsigned long nr_pages); extern int vm_swappiness; Index: mmotm-0710/mm/memcontrol.c =================================================================== --- mmotm-0710.orig/mm/memcontrol.c +++ mmotm-0710/mm/memcontrol.c @@ -204,6 +204,50 @@ struct mem_cgroup_eventfd_list { static void mem_cgroup_threshold(struct mem_cgroup *mem); static void mem_cgroup_oom_notify(struct mem_cgroup *mem); +enum { + SCAN_BY_LIMIT, + SCAN_BY_SYSTEM, + NR_SCAN_CONTEXT, + SCAN_BY_SHRINK, /* not recorded now */ +}; + +enum { + SCAN, + SCAN_ANON, + SCAN_FILE, + ROTATE, + ROTATE_ANON, + ROTATE_FILE, + FREED, + FREED_ANON, + FREED_FILE, + ELAPSED, + NR_SCANSTATS, +}; + +struct scanstat { + spinlock_t lock; + unsigned long stats[NR_SCAN_CONTEXT][NR_SCANSTATS]; + unsigned long rootstats[NR_SCAN_CONTEXT][NR_SCANSTATS]; +}; + +const char *scanstat_string[NR_SCANSTATS] = { + "scanned_pages", + "scanned_anon_pages", + "scanned_file_pages", + "rotated_pages", + "rotated_anon_pages", + "rotated_file_pages", + "freed_pages", + "freed_anon_pages", + "freed_file_pages", + "elapsed_ns", +}; +#define SCANSTAT_WORD_LIMIT "_by_limit" +#define SCANSTAT_WORD_SYSTEM "_by_system" +#define SCANSTAT_WORD_HIERARCHY "_under_hierarchy" + + /* * The memory controller data structure. The memory controller controls both * page cache and RSS per cgroup. We would eventually like to provide @@ -266,7 +310,8 @@ struct mem_cgroup { /* For oom notifier event fd */ struct list_head oom_notify; - + /* For recording LRU-scan statistics */ + struct scanstat scanstat; /* * Should we move charges of a task when a task is moved into this * mem_cgroup ? And what type of charges should we move ? @@ -1619,6 +1664,44 @@ bool mem_cgroup_reclaimable(struct mem_c } #endif +static void __mem_cgroup_record_scanstat(unsigned long *stats, + struct memcg_scanrecord *rec) +{ + + stats[SCAN] += rec->nr_scanned[0] + rec->nr_scanned[1]; + stats[SCAN_ANON] += rec->nr_scanned[0]; + stats[SCAN_FILE] += rec->nr_scanned[1]; + + stats[ROTATE] += rec->nr_rotated[0] + rec->nr_rotated[1]; + stats[ROTATE_ANON] += rec->nr_rotated[0]; + stats[ROTATE_FILE] += rec->nr_rotated[1]; + + stats[FREED] += rec->nr_freed[0] + rec->nr_freed[1]; + stats[FREED_ANON] += rec->nr_freed[0]; + stats[FREED_FILE] += rec->nr_freed[1]; + + stats[ELAPSED] += rec->elapsed; +} + +static void mem_cgroup_record_scanstat(struct memcg_scanrecord *rec) +{ + struct mem_cgroup *mem; + int context = rec->context; + + if (context >= NR_SCAN_CONTEXT) + return; + + mem = rec->mem; + spin_lock(&mem->scanstat.lock); + __mem_cgroup_record_scanstat(mem->scanstat.stats[context], rec); + spin_unlock(&mem->scanstat.lock); + + mem = rec->root; + spin_lock(&mem->scanstat.lock); + __mem_cgroup_record_scanstat(mem->scanstat.rootstats[context], rec); + spin_unlock(&mem->scanstat.lock); +} + /* * Scan the hierarchy if needed to reclaim memory. We remember the last child * we reclaimed from, so that we don't end up penalizing one child extensively @@ -1643,8 +1726,9 @@ static int mem_cgroup_hierarchical_recla bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP; bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK; bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT; + struct memcg_scanrecord rec; unsigned long excess; - unsigned long nr_scanned; + unsigned long scanned; excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT; @@ -1652,6 +1736,15 @@ static int mem_cgroup_hierarchical_recla if (!check_soft && root_mem->memsw_is_minimum) noswap = true; + if (shrink) + rec.context = SCAN_BY_SHRINK; + else if (check_soft) + rec.context = SCAN_BY_SYSTEM; + else + rec.context = SCAN_BY_LIMIT; + + rec.root = root_mem; + while (1) { victim = mem_cgroup_select_victim(root_mem); if (victim == root_mem) { @@ -1692,14 +1785,23 @@ static int mem_cgroup_hierarchical_recla css_put(&victim->css); continue; } + rec.mem = victim; + rec.nr_scanned[0] = 0; + rec.nr_scanned[1] = 0; + rec.nr_rotated[0] = 0; + rec.nr_rotated[1] = 0; + rec.nr_freed[0] = 0; + rec.nr_freed[1] = 0; + rec.elapsed = 0; /* we use swappiness of local cgroup */ if (check_soft) { ret = mem_cgroup_shrink_node_zone(victim, gfp_mask, - noswap, zone, &nr_scanned); - *total_scanned += nr_scanned; + noswap, zone, &rec, &scanned); + *total_scanned += scanned; } else ret = try_to_free_mem_cgroup_pages(victim, gfp_mask, - noswap); + noswap, &rec); + mem_cgroup_record_scanstat(&rec); css_put(&victim->css); /* * At shrinking usage, we can't check we should stop here or @@ -3688,14 +3790,18 @@ try_to_free: /* try to free all pages in this cgroup */ shrink = 1; while (nr_retries && mem->res.usage > 0) { + struct memcg_scanrecord rec; int progress; if (signal_pending(current)) { ret = -EINTR; goto out; } + rec.context = SCAN_BY_SHRINK; + rec.mem = mem; + rec.root = mem; progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL, - false); + false, &rec); if (!progress) { nr_retries--; /* maybe some writeback is necessary */ @@ -4539,6 +4645,54 @@ static int mem_control_numa_stat_open(st } #endif /* CONFIG_NUMA */ +static int mem_cgroup_vmscan_stat_read(struct cgroup *cgrp, + struct cftype *cft, + struct cgroup_map_cb *cb) +{ + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp); + char string[64]; + int i; + + for (i = 0; i < NR_SCANSTATS; i++) { + strcpy(string, scanstat_string[i]); + strcat(string, SCANSTAT_WORD_LIMIT); + cb->fill(cb, string, mem->scanstat.stats[SCAN_BY_LIMIT][i]); + } + + for (i = 0; i < NR_SCANSTATS; i++) { + strcpy(string, scanstat_string[i]); + strcat(string, SCANSTAT_WORD_SYSTEM); + cb->fill(cb, string, mem->scanstat.stats[SCAN_BY_SYSTEM][i]); + } + + for (i = 0; i < NR_SCANSTATS; i++) { + strcpy(string, scanstat_string[i]); + strcat(string, SCANSTAT_WORD_LIMIT); + strcat(string, SCANSTAT_WORD_HIERARCHY); + cb->fill(cb, string, mem->scanstat.rootstats[SCAN_BY_LIMIT][i]); + } + for (i = 0; i < NR_SCANSTATS; i++) { + strcpy(string, scanstat_string[i]); + strcat(string, SCANSTAT_WORD_SYSTEM); + strcat(string, SCANSTAT_WORD_HIERARCHY); + cb->fill(cb, string, mem->scanstat.rootstats[SCAN_BY_SYSTEM][i]); + } + return 0; +} + +static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp, + unsigned int event) +{ + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp); + + spin_lock(&mem->scanstat.lock); + memset(&mem->scanstat.stats, 0, sizeof(mem->scanstat.stats)); + memset(&mem->scanstat.rootstats, 0, sizeof(mem->scanstat.rootstats)); + spin_unlock(&mem->scanstat.lock); + return 0; +} + + static struct cftype mem_cgroup_files[] = { { .name = "usage_in_bytes", @@ -4609,6 +4763,11 @@ static struct cftype mem_cgroup_files[] .mode = S_IRUGO, }, #endif + { + .name = "vmscan_stat", + .read_map = mem_cgroup_vmscan_stat_read, + .trigger = mem_cgroup_reset_vmscan_stat, + }, }; #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP @@ -4872,6 +5031,7 @@ mem_cgroup_create(struct cgroup_subsys * atomic_set(&mem->refcnt, 1); mem->move_charge_at_immigrate = 0; mutex_init(&mem->thresholds_lock); + spin_lock_init(&mem->scanstat.lock); return &mem->css; free_out: __mem_cgroup_free(mem); Index: mmotm-0710/mm/vmscan.c =================================================================== --- mmotm-0710.orig/mm/vmscan.c +++ mmotm-0710/mm/vmscan.c @@ -105,6 +105,7 @@ struct scan_control { /* Which cgroup do we reclaim from */ struct mem_cgroup *mem_cgroup; + struct memcg_scanrecord *memcg_record; /* * Nodemask of nodes allowed by the caller. If NULL, all nodes @@ -1307,6 +1308,8 @@ putback_lru_pages(struct zone *zone, str int file = is_file_lru(lru); int numpages = hpage_nr_pages(page); reclaim_stat->recent_rotated[file] += numpages; + if (!scanning_global_lru(sc)) + sc->memcg_record->nr_rotated[file] += numpages; } if (!pagevec_add(&pvec, page)) { spin_unlock_irq(&zone->lru_lock); @@ -1350,6 +1353,10 @@ static noinline_for_stack void update_is reclaim_stat->recent_scanned[0] += *nr_anon; reclaim_stat->recent_scanned[1] += *nr_file; + if (!scanning_global_lru(sc)) { + sc->memcg_record->nr_scanned[0] += *nr_anon; + sc->memcg_record->nr_scanned[1] += *nr_file; + } } /* @@ -1463,6 +1470,9 @@ shrink_inactive_list(unsigned long nr_to nr_reclaimed += shrink_page_list(&page_list, zone, sc); } + if (!scanning_global_lru(sc)) + sc->memcg_record->nr_freed[file] += nr_reclaimed; + local_irq_disable(); if (current_is_kswapd()) __count_vm_events(KSWAPD_STEAL, nr_reclaimed); @@ -1562,6 +1572,8 @@ static void shrink_active_list(unsigned } reclaim_stat->recent_scanned[file] += nr_taken; + if (!scanning_global_lru(sc)) + sc->memcg_record->nr_scanned[file] += nr_taken; __count_zone_vm_events(PGREFILL, zone, pgscanned); if (file) @@ -1613,6 +1625,8 @@ static void shrink_active_list(unsigned * get_scan_ratio. */ reclaim_stat->recent_rotated[file] += nr_rotated; + if (!scanning_global_lru(sc)) + sc->memcg_record->nr_rotated[file] += nr_rotated; move_active_pages_to_lru(zone, &l_active, LRU_ACTIVE + file * LRU_FILE); @@ -2207,9 +2229,10 @@ unsigned long try_to_free_pages(struct z #ifdef CONFIG_CGROUP_MEM_RES_CTLR unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, - gfp_t gfp_mask, bool noswap, - struct zone *zone, - unsigned long *nr_scanned) + gfp_t gfp_mask, bool noswap, + struct zone *zone, + struct memcg_scanrecord *rec, + unsigned long *scanned) { struct scan_control sc = { .nr_scanned = 0, @@ -2219,7 +2242,9 @@ unsigned long mem_cgroup_shrink_node_zon .may_swap = !noswap, .order = 0, .mem_cgroup = mem, + .memcg_record = rec, }; + unsigned long start, end; sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK); @@ -2228,6 +2253,7 @@ unsigned long mem_cgroup_shrink_node_zon sc.may_writepage, sc.gfp_mask); + start = sched_clock(); /* * NOTE: Although we can get the priority field, using it * here is not a good idea, since it limits the pages we can scan. @@ -2236,19 +2262,25 @@ unsigned long mem_cgroup_shrink_node_zon * the priority and make it zero. */ shrink_zone(0, zone, &sc); + end = sched_clock(); + + if (rec) + rec->elapsed += end - start; + *scanned = sc.nr_scanned; trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed); - *nr_scanned = sc.nr_scanned; return sc.nr_reclaimed; } unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont, gfp_t gfp_mask, - bool noswap) + bool noswap, + struct memcg_scanrecord *rec) { struct zonelist *zonelist; unsigned long nr_reclaimed; + unsigned long start, end; int nid; struct scan_control sc = { .may_writepage = !laptop_mode, @@ -2257,6 +2289,7 @@ unsigned long try_to_free_mem_cgroup_pag .nr_to_reclaim = SWAP_CLUSTER_MAX, .order = 0, .mem_cgroup = mem_cont, + .memcg_record = rec, .nodemask = NULL, /* we don't care the placement */ .gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK), @@ -2265,6 +2298,7 @@ unsigned long try_to_free_mem_cgroup_pag .gfp_mask = sc.gfp_mask, }; + start = sched_clock(); /* * Unlike direct reclaim via alloc_pages(), memcg's reclaim doesn't * take care of from where we get pages. So the node where we start the @@ -2279,6 +2313,9 @@ unsigned long try_to_free_mem_cgroup_pag sc.gfp_mask); nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink); + end = sched_clock(); + if (rec) + rec->elapsed += end - start; trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed); ^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH v3] memcg: add memory.vmscan_stat @ 2011-07-22 8:15 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-07-22 8:15 UTC (permalink / raw) To: linux-mm; +Cc: linux-kernel, nishimura, Michal Hocko, akpm, abrestic [PATCH] add memory.vmscan_stat commit log of commit 0ae5e89 " memcg: count the soft_limit reclaim in..." says it adds scanning stats to memory.stat file. But it doesn't because we considered we needed to make a concensus for such new APIs. This patch is a trial to add memory.scan_stat. This shows - the number of scanned pages(total, anon, file) - the number of rotated pages(total, anon, file) - the number of freed pages(total, anon, file) - the number of elaplsed time (including sleep/pause time) for both of direct/soft reclaim. The biggest difference with oringinal Ying's one is that this file can be reset by some write, as # echo 0 ...../memory.scan_stat Example of output is here. This is a result after make -j 6 kernel under 300M limit. [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat scanned_pages_by_limit 9471864 scanned_anon_pages_by_limit 6640629 scanned_file_pages_by_limit 2831235 rotated_pages_by_limit 4243974 rotated_anon_pages_by_limit 3971968 rotated_file_pages_by_limit 272006 freed_pages_by_limit 2318492 freed_anon_pages_by_limit 962052 freed_file_pages_by_limit 1356440 elapsed_ns_by_limit 351386416101 scanned_pages_by_system 0 scanned_anon_pages_by_system 0 scanned_file_pages_by_system 0 rotated_pages_by_system 0 rotated_anon_pages_by_system 0 rotated_file_pages_by_system 0 freed_pages_by_system 0 freed_anon_pages_by_system 0 freed_file_pages_by_system 0 elapsed_ns_by_system 0 scanned_pages_by_limit_under_hierarchy 9471864 scanned_anon_pages_by_limit_under_hierarchy 6640629 scanned_file_pages_by_limit_under_hierarchy 2831235 rotated_pages_by_limit_under_hierarchy 4243974 rotated_anon_pages_by_limit_under_hierarchy 3971968 rotated_file_pages_by_limit_under_hierarchy 272006 freed_pages_by_limit_under_hierarchy 2318492 freed_anon_pages_by_limit_under_hierarchy 962052 freed_file_pages_by_limit_under_hierarchy 1356440 elapsed_ns_by_limit_under_hierarchy 351386416101 scanned_pages_by_system_under_hierarchy 0 scanned_anon_pages_by_system_under_hierarchy 0 scanned_file_pages_by_system_under_hierarchy 0 rotated_pages_by_system_under_hierarchy 0 rotated_anon_pages_by_system_under_hierarchy 0 rotated_file_pages_by_system_under_hierarchy 0 freed_pages_by_system_under_hierarchy 0 freed_anon_pages_by_system_under_hierarchy 0 freed_file_pages_by_system_under_hierarchy 0 elapsed_ns_by_system_under_hierarchy 0 total_xxxx is for hierarchy management. This will be useful for further memcg developments and need to be developped before we do some complicated rework on LRU/softlimit management. This patch adds a new struct memcg_scanrecord into scan_control struct. sc->nr_scanned at el is not designed for exporting information. For example, nr_scanned is reset frequentrly and incremented +2 at scanning mapped pages. For avoiding complexity, I added a new param in scan_control which is for exporting scanning score. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Changelog: - fixed the trigger for recording nr_freed in shrink_inactive_list() Changelog: - renamed as vmscan_stat - handle file/anon - added "rotated" - changed names of param in vmscan_stat. --- Documentation/cgroups/memory.txt | 85 +++++++++++++++++++ include/linux/memcontrol.h | 19 ++++ include/linux/swap.h | 6 - mm/memcontrol.c | 172 +++++++++++++++++++++++++++++++++++++-- mm/vmscan.c | 39 +++++++- 5 files changed, 303 insertions(+), 18 deletions(-) Index: mmotm-0710/Documentation/cgroups/memory.txt =================================================================== --- mmotm-0710.orig/Documentation/cgroups/memory.txt +++ mmotm-0710/Documentation/cgroups/memory.txt @@ -380,7 +380,7 @@ will be charged as a new owner of it. 5.2 stat file -memory.stat file includes following statistics +5.2.1 memory.stat file includes following statistics # per-memory cgroup local status cache - # of bytes of page cache memory. @@ -438,6 +438,89 @@ Note: file_mapped is accounted only when the memory cgroup is owner of page cache.) +5.2.2 memory.vmscan_stat + +memory.vmscan_stat includes statistics information for memory scanning and +freeing, reclaiming. The statistics shows memory scanning information since +memory cgroup creation and can be reset to 0 by writing 0 as + + #echo 0 > ../memory.vmscan_stat + +This file contains following statistics. + +[param]_[file_or_anon]_pages_by_[reason]_[under_heararchy] +[param]_elapsed_ns_by_[reason]_[under_hierarchy] + +For example, + + scanned_file_pages_by_limit indicates the number of scanned + file pages at vmscan. + +Now, 3 parameters are supported + + scanned - the number of pages scanned by vmscan + rotated - the number of pages activated at vmscan + freed - the number of pages freed by vmscan + +If "rotated" is high against scanned/freed, the memcg seems busy. + +Now, 2 reason are supported + + limit - the memory cgroup's limit + system - global memory pressure + softlimit + (global memory pressure not under softlimit is not handled now) + +When under_hierarchy is added in the tail, the number indicates the +total memcg scan of its children and itself. + +elapsed_ns is a elapsed time in nanosecond. This may include sleep time +and not indicates CPU usage. So, please take this as just showing +latency. + +Here is an example. + +# cat /cgroup/memory/A/memory.vmscan_stat +scanned_pages_by_limit 9471864 +scanned_anon_pages_by_limit 6640629 +scanned_file_pages_by_limit 2831235 +rotated_pages_by_limit 4243974 +rotated_anon_pages_by_limit 3971968 +rotated_file_pages_by_limit 272006 +freed_pages_by_limit 2318492 +freed_anon_pages_by_limit 962052 +freed_file_pages_by_limit 1356440 +elapsed_ns_by_limit 351386416101 +scanned_pages_by_system 0 +scanned_anon_pages_by_system 0 +scanned_file_pages_by_system 0 +rotated_pages_by_system 0 +rotated_anon_pages_by_system 0 +rotated_file_pages_by_system 0 +freed_pages_by_system 0 +freed_anon_pages_by_system 0 +freed_file_pages_by_system 0 +elapsed_ns_by_system 0 +scanned_pages_by_limit_under_hierarchy 9471864 +scanned_anon_pages_by_limit_under_hierarchy 6640629 +scanned_file_pages_by_limit_under_hierarchy 2831235 +rotated_pages_by_limit_under_hierarchy 4243974 +rotated_anon_pages_by_limit_under_hierarchy 3971968 +rotated_file_pages_by_limit_under_hierarchy 272006 +freed_pages_by_limit_under_hierarchy 2318492 +freed_anon_pages_by_limit_under_hierarchy 962052 +freed_file_pages_by_limit_under_hierarchy 1356440 +elapsed_ns_by_limit_under_hierarchy 351386416101 +scanned_pages_by_system_under_hierarchy 0 +scanned_anon_pages_by_system_under_hierarchy 0 +scanned_file_pages_by_system_under_hierarchy 0 +rotated_pages_by_system_under_hierarchy 0 +rotated_anon_pages_by_system_under_hierarchy 0 +rotated_file_pages_by_system_under_hierarchy 0 +freed_pages_by_system_under_hierarchy 0 +freed_anon_pages_by_system_under_hierarchy 0 +freed_file_pages_by_system_under_hierarchy 0 +elapsed_ns_by_system_under_hierarchy 0 + 5.3 swappiness Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only. Index: mmotm-0710/include/linux/memcontrol.h =================================================================== --- mmotm-0710.orig/include/linux/memcontrol.h +++ mmotm-0710/include/linux/memcontrol.h @@ -39,6 +39,16 @@ extern unsigned long mem_cgroup_isolate_ struct mem_cgroup *mem_cont, int active, int file); +struct memcg_scanrecord { + struct mem_cgroup *mem; /* scanend memory cgroup */ + struct mem_cgroup *root; /* scan target hierarchy root */ + int context; /* scanning context (see memcontrol.c) */ + unsigned long nr_scanned[2]; /* the number of scanned pages */ + unsigned long nr_rotated[2]; /* the number of rotated pages */ + unsigned long nr_freed[2]; /* the number of freed pages */ + unsigned long elapsed; /* nsec of time elapsed while scanning */ +}; + #ifdef CONFIG_CGROUP_MEM_RES_CTLR /* * All "charge" functions with gfp_mask should use GFP_KERNEL or @@ -117,6 +127,15 @@ mem_cgroup_get_reclaim_stat_from_page(st extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); +extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem, + gfp_t gfp_mask, bool noswap, + struct memcg_scanrecord *rec); +extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, + gfp_t gfp_mask, bool noswap, + struct zone *zone, + struct memcg_scanrecord *rec, + unsigned long *nr_scanned); + #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP extern int do_swap_account; #endif Index: mmotm-0710/include/linux/swap.h =================================================================== --- mmotm-0710.orig/include/linux/swap.h +++ mmotm-0710/include/linux/swap.h @@ -253,12 +253,6 @@ static inline void lru_cache_add_file(st /* linux/mm/vmscan.c */ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *mask); -extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem, - gfp_t gfp_mask, bool noswap); -extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, - gfp_t gfp_mask, bool noswap, - struct zone *zone, - unsigned long *nr_scanned); extern int __isolate_lru_page(struct page *page, int mode, int file); extern unsigned long shrink_all_memory(unsigned long nr_pages); extern int vm_swappiness; Index: mmotm-0710/mm/memcontrol.c =================================================================== --- mmotm-0710.orig/mm/memcontrol.c +++ mmotm-0710/mm/memcontrol.c @@ -204,6 +204,50 @@ struct mem_cgroup_eventfd_list { static void mem_cgroup_threshold(struct mem_cgroup *mem); static void mem_cgroup_oom_notify(struct mem_cgroup *mem); +enum { + SCAN_BY_LIMIT, + SCAN_BY_SYSTEM, + NR_SCAN_CONTEXT, + SCAN_BY_SHRINK, /* not recorded now */ +}; + +enum { + SCAN, + SCAN_ANON, + SCAN_FILE, + ROTATE, + ROTATE_ANON, + ROTATE_FILE, + FREED, + FREED_ANON, + FREED_FILE, + ELAPSED, + NR_SCANSTATS, +}; + +struct scanstat { + spinlock_t lock; + unsigned long stats[NR_SCAN_CONTEXT][NR_SCANSTATS]; + unsigned long rootstats[NR_SCAN_CONTEXT][NR_SCANSTATS]; +}; + +const char *scanstat_string[NR_SCANSTATS] = { + "scanned_pages", + "scanned_anon_pages", + "scanned_file_pages", + "rotated_pages", + "rotated_anon_pages", + "rotated_file_pages", + "freed_pages", + "freed_anon_pages", + "freed_file_pages", + "elapsed_ns", +}; +#define SCANSTAT_WORD_LIMIT "_by_limit" +#define SCANSTAT_WORD_SYSTEM "_by_system" +#define SCANSTAT_WORD_HIERARCHY "_under_hierarchy" + + /* * The memory controller data structure. The memory controller controls both * page cache and RSS per cgroup. We would eventually like to provide @@ -266,7 +310,8 @@ struct mem_cgroup { /* For oom notifier event fd */ struct list_head oom_notify; - + /* For recording LRU-scan statistics */ + struct scanstat scanstat; /* * Should we move charges of a task when a task is moved into this * mem_cgroup ? And what type of charges should we move ? @@ -1619,6 +1664,44 @@ bool mem_cgroup_reclaimable(struct mem_c } #endif +static void __mem_cgroup_record_scanstat(unsigned long *stats, + struct memcg_scanrecord *rec) +{ + + stats[SCAN] += rec->nr_scanned[0] + rec->nr_scanned[1]; + stats[SCAN_ANON] += rec->nr_scanned[0]; + stats[SCAN_FILE] += rec->nr_scanned[1]; + + stats[ROTATE] += rec->nr_rotated[0] + rec->nr_rotated[1]; + stats[ROTATE_ANON] += rec->nr_rotated[0]; + stats[ROTATE_FILE] += rec->nr_rotated[1]; + + stats[FREED] += rec->nr_freed[0] + rec->nr_freed[1]; + stats[FREED_ANON] += rec->nr_freed[0]; + stats[FREED_FILE] += rec->nr_freed[1]; + + stats[ELAPSED] += rec->elapsed; +} + +static void mem_cgroup_record_scanstat(struct memcg_scanrecord *rec) +{ + struct mem_cgroup *mem; + int context = rec->context; + + if (context >= NR_SCAN_CONTEXT) + return; + + mem = rec->mem; + spin_lock(&mem->scanstat.lock); + __mem_cgroup_record_scanstat(mem->scanstat.stats[context], rec); + spin_unlock(&mem->scanstat.lock); + + mem = rec->root; + spin_lock(&mem->scanstat.lock); + __mem_cgroup_record_scanstat(mem->scanstat.rootstats[context], rec); + spin_unlock(&mem->scanstat.lock); +} + /* * Scan the hierarchy if needed to reclaim memory. We remember the last child * we reclaimed from, so that we don't end up penalizing one child extensively @@ -1643,8 +1726,9 @@ static int mem_cgroup_hierarchical_recla bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP; bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK; bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT; + struct memcg_scanrecord rec; unsigned long excess; - unsigned long nr_scanned; + unsigned long scanned; excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT; @@ -1652,6 +1736,15 @@ static int mem_cgroup_hierarchical_recla if (!check_soft && root_mem->memsw_is_minimum) noswap = true; + if (shrink) + rec.context = SCAN_BY_SHRINK; + else if (check_soft) + rec.context = SCAN_BY_SYSTEM; + else + rec.context = SCAN_BY_LIMIT; + + rec.root = root_mem; + while (1) { victim = mem_cgroup_select_victim(root_mem); if (victim == root_mem) { @@ -1692,14 +1785,23 @@ static int mem_cgroup_hierarchical_recla css_put(&victim->css); continue; } + rec.mem = victim; + rec.nr_scanned[0] = 0; + rec.nr_scanned[1] = 0; + rec.nr_rotated[0] = 0; + rec.nr_rotated[1] = 0; + rec.nr_freed[0] = 0; + rec.nr_freed[1] = 0; + rec.elapsed = 0; /* we use swappiness of local cgroup */ if (check_soft) { ret = mem_cgroup_shrink_node_zone(victim, gfp_mask, - noswap, zone, &nr_scanned); - *total_scanned += nr_scanned; + noswap, zone, &rec, &scanned); + *total_scanned += scanned; } else ret = try_to_free_mem_cgroup_pages(victim, gfp_mask, - noswap); + noswap, &rec); + mem_cgroup_record_scanstat(&rec); css_put(&victim->css); /* * At shrinking usage, we can't check we should stop here or @@ -3688,14 +3790,18 @@ try_to_free: /* try to free all pages in this cgroup */ shrink = 1; while (nr_retries && mem->res.usage > 0) { + struct memcg_scanrecord rec; int progress; if (signal_pending(current)) { ret = -EINTR; goto out; } + rec.context = SCAN_BY_SHRINK; + rec.mem = mem; + rec.root = mem; progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL, - false); + false, &rec); if (!progress) { nr_retries--; /* maybe some writeback is necessary */ @@ -4539,6 +4645,54 @@ static int mem_control_numa_stat_open(st } #endif /* CONFIG_NUMA */ +static int mem_cgroup_vmscan_stat_read(struct cgroup *cgrp, + struct cftype *cft, + struct cgroup_map_cb *cb) +{ + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp); + char string[64]; + int i; + + for (i = 0; i < NR_SCANSTATS; i++) { + strcpy(string, scanstat_string[i]); + strcat(string, SCANSTAT_WORD_LIMIT); + cb->fill(cb, string, mem->scanstat.stats[SCAN_BY_LIMIT][i]); + } + + for (i = 0; i < NR_SCANSTATS; i++) { + strcpy(string, scanstat_string[i]); + strcat(string, SCANSTAT_WORD_SYSTEM); + cb->fill(cb, string, mem->scanstat.stats[SCAN_BY_SYSTEM][i]); + } + + for (i = 0; i < NR_SCANSTATS; i++) { + strcpy(string, scanstat_string[i]); + strcat(string, SCANSTAT_WORD_LIMIT); + strcat(string, SCANSTAT_WORD_HIERARCHY); + cb->fill(cb, string, mem->scanstat.rootstats[SCAN_BY_LIMIT][i]); + } + for (i = 0; i < NR_SCANSTATS; i++) { + strcpy(string, scanstat_string[i]); + strcat(string, SCANSTAT_WORD_SYSTEM); + strcat(string, SCANSTAT_WORD_HIERARCHY); + cb->fill(cb, string, mem->scanstat.rootstats[SCAN_BY_SYSTEM][i]); + } + return 0; +} + +static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp, + unsigned int event) +{ + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp); + + spin_lock(&mem->scanstat.lock); + memset(&mem->scanstat.stats, 0, sizeof(mem->scanstat.stats)); + memset(&mem->scanstat.rootstats, 0, sizeof(mem->scanstat.rootstats)); + spin_unlock(&mem->scanstat.lock); + return 0; +} + + static struct cftype mem_cgroup_files[] = { { .name = "usage_in_bytes", @@ -4609,6 +4763,11 @@ static struct cftype mem_cgroup_files[] .mode = S_IRUGO, }, #endif + { + .name = "vmscan_stat", + .read_map = mem_cgroup_vmscan_stat_read, + .trigger = mem_cgroup_reset_vmscan_stat, + }, }; #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP @@ -4872,6 +5031,7 @@ mem_cgroup_create(struct cgroup_subsys * atomic_set(&mem->refcnt, 1); mem->move_charge_at_immigrate = 0; mutex_init(&mem->thresholds_lock); + spin_lock_init(&mem->scanstat.lock); return &mem->css; free_out: __mem_cgroup_free(mem); Index: mmotm-0710/mm/vmscan.c =================================================================== --- mmotm-0710.orig/mm/vmscan.c +++ mmotm-0710/mm/vmscan.c @@ -105,6 +105,7 @@ struct scan_control { /* Which cgroup do we reclaim from */ struct mem_cgroup *mem_cgroup; + struct memcg_scanrecord *memcg_record; /* * Nodemask of nodes allowed by the caller. If NULL, all nodes @@ -1307,6 +1308,8 @@ putback_lru_pages(struct zone *zone, str int file = is_file_lru(lru); int numpages = hpage_nr_pages(page); reclaim_stat->recent_rotated[file] += numpages; + if (!scanning_global_lru(sc)) + sc->memcg_record->nr_rotated[file] += numpages; } if (!pagevec_add(&pvec, page)) { spin_unlock_irq(&zone->lru_lock); @@ -1350,6 +1353,10 @@ static noinline_for_stack void update_is reclaim_stat->recent_scanned[0] += *nr_anon; reclaim_stat->recent_scanned[1] += *nr_file; + if (!scanning_global_lru(sc)) { + sc->memcg_record->nr_scanned[0] += *nr_anon; + sc->memcg_record->nr_scanned[1] += *nr_file; + } } /* @@ -1463,6 +1470,9 @@ shrink_inactive_list(unsigned long nr_to nr_reclaimed += shrink_page_list(&page_list, zone, sc); } + if (!scanning_global_lru(sc)) + sc->memcg_record->nr_freed[file] += nr_reclaimed; + local_irq_disable(); if (current_is_kswapd()) __count_vm_events(KSWAPD_STEAL, nr_reclaimed); @@ -1562,6 +1572,8 @@ static void shrink_active_list(unsigned } reclaim_stat->recent_scanned[file] += nr_taken; + if (!scanning_global_lru(sc)) + sc->memcg_record->nr_scanned[file] += nr_taken; __count_zone_vm_events(PGREFILL, zone, pgscanned); if (file) @@ -1613,6 +1625,8 @@ static void shrink_active_list(unsigned * get_scan_ratio. */ reclaim_stat->recent_rotated[file] += nr_rotated; + if (!scanning_global_lru(sc)) + sc->memcg_record->nr_rotated[file] += nr_rotated; move_active_pages_to_lru(zone, &l_active, LRU_ACTIVE + file * LRU_FILE); @@ -2207,9 +2229,10 @@ unsigned long try_to_free_pages(struct z #ifdef CONFIG_CGROUP_MEM_RES_CTLR unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, - gfp_t gfp_mask, bool noswap, - struct zone *zone, - unsigned long *nr_scanned) + gfp_t gfp_mask, bool noswap, + struct zone *zone, + struct memcg_scanrecord *rec, + unsigned long *scanned) { struct scan_control sc = { .nr_scanned = 0, @@ -2219,7 +2242,9 @@ unsigned long mem_cgroup_shrink_node_zon .may_swap = !noswap, .order = 0, .mem_cgroup = mem, + .memcg_record = rec, }; + unsigned long start, end; sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK); @@ -2228,6 +2253,7 @@ unsigned long mem_cgroup_shrink_node_zon sc.may_writepage, sc.gfp_mask); + start = sched_clock(); /* * NOTE: Although we can get the priority field, using it * here is not a good idea, since it limits the pages we can scan. @@ -2236,19 +2262,25 @@ unsigned long mem_cgroup_shrink_node_zon * the priority and make it zero. */ shrink_zone(0, zone, &sc); + end = sched_clock(); + + if (rec) + rec->elapsed += end - start; + *scanned = sc.nr_scanned; trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed); - *nr_scanned = sc.nr_scanned; return sc.nr_reclaimed; } unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont, gfp_t gfp_mask, - bool noswap) + bool noswap, + struct memcg_scanrecord *rec) { struct zonelist *zonelist; unsigned long nr_reclaimed; + unsigned long start, end; int nid; struct scan_control sc = { .may_writepage = !laptop_mode, @@ -2257,6 +2289,7 @@ unsigned long try_to_free_mem_cgroup_pag .nr_to_reclaim = SWAP_CLUSTER_MAX, .order = 0, .mem_cgroup = mem_cont, + .memcg_record = rec, .nodemask = NULL, /* we don't care the placement */ .gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK), @@ -2265,6 +2298,7 @@ unsigned long try_to_free_mem_cgroup_pag .gfp_mask = sc.gfp_mask, }; + start = sched_clock(); /* * Unlike direct reclaim via alloc_pages(), memcg's reclaim doesn't * take care of from where we get pages. So the node where we start the @@ -2279,6 +2313,9 @@ unsigned long try_to_free_mem_cgroup_pag sc.gfp_mask); nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink); + end = sched_clock(); + if (rec) + rec->elapsed += end - start; trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v3] memcg: add memory.vmscan_stat 2011-07-22 8:15 ` KAMEZAWA Hiroyuki @ 2011-08-08 12:43 ` Johannes Weiner -1 siblings, 0 replies; 54+ messages in thread From: Johannes Weiner @ 2011-08-08 12:43 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: linux-mm, linux-kernel, nishimura, Michal Hocko, akpm, abrestic On Fri, Jul 22, 2011 at 05:15:40PM +0900, KAMEZAWA Hiroyuki wrote: > [PATCH] add memory.vmscan_stat > > commit log of commit 0ae5e89 " memcg: count the soft_limit reclaim in..." > says it adds scanning stats to memory.stat file. But it doesn't because > we considered we needed to make a concensus for such new APIs. > > This patch is a trial to add memory.scan_stat. This shows > - the number of scanned pages(total, anon, file) > - the number of rotated pages(total, anon, file) > - the number of freed pages(total, anon, file) > - the number of elaplsed time (including sleep/pause time) > > for both of direct/soft reclaim. > > The biggest difference with oringinal Ying's one is that this file > can be reset by some write, as > > # echo 0 ...../memory.scan_stat > > Example of output is here. This is a result after make -j 6 kernel > under 300M limit. > > [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat > [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat > scanned_pages_by_limit 9471864 > scanned_anon_pages_by_limit 6640629 > scanned_file_pages_by_limit 2831235 > rotated_pages_by_limit 4243974 > rotated_anon_pages_by_limit 3971968 > rotated_file_pages_by_limit 272006 > freed_pages_by_limit 2318492 > freed_anon_pages_by_limit 962052 > freed_file_pages_by_limit 1356440 > elapsed_ns_by_limit 351386416101 > scanned_pages_by_system 0 > scanned_anon_pages_by_system 0 > scanned_file_pages_by_system 0 > rotated_pages_by_system 0 > rotated_anon_pages_by_system 0 > rotated_file_pages_by_system 0 > freed_pages_by_system 0 > freed_anon_pages_by_system 0 > freed_file_pages_by_system 0 > elapsed_ns_by_system 0 > scanned_pages_by_limit_under_hierarchy 9471864 > scanned_anon_pages_by_limit_under_hierarchy 6640629 > scanned_file_pages_by_limit_under_hierarchy 2831235 > rotated_pages_by_limit_under_hierarchy 4243974 > rotated_anon_pages_by_limit_under_hierarchy 3971968 > rotated_file_pages_by_limit_under_hierarchy 272006 > freed_pages_by_limit_under_hierarchy 2318492 > freed_anon_pages_by_limit_under_hierarchy 962052 > freed_file_pages_by_limit_under_hierarchy 1356440 > elapsed_ns_by_limit_under_hierarchy 351386416101 > scanned_pages_by_system_under_hierarchy 0 > scanned_anon_pages_by_system_under_hierarchy 0 > scanned_file_pages_by_system_under_hierarchy 0 > rotated_pages_by_system_under_hierarchy 0 > rotated_anon_pages_by_system_under_hierarchy 0 > rotated_file_pages_by_system_under_hierarchy 0 > freed_pages_by_system_under_hierarchy 0 > freed_anon_pages_by_system_under_hierarchy 0 > freed_file_pages_by_system_under_hierarchy 0 > elapsed_ns_by_system_under_hierarchy 0 > > total_xxxx is for hierarchy management. > > This will be useful for further memcg developments and need to be > developped before we do some complicated rework on LRU/softlimit > management. > > This patch adds a new struct memcg_scanrecord into scan_control struct. > sc->nr_scanned at el is not designed for exporting information. For example, > nr_scanned is reset frequentrly and incremented +2 at scanning mapped pages. > > For avoiding complexity, I added a new param in scan_control which is for > exporting scanning score. > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > > Changelog: > - fixed the trigger for recording nr_freed in shrink_inactive_list() > Changelog: > - renamed as vmscan_stat > - handle file/anon > - added "rotated" > - changed names of param in vmscan_stat. > --- > Documentation/cgroups/memory.txt | 85 +++++++++++++++++++ > include/linux/memcontrol.h | 19 ++++ > include/linux/swap.h | 6 - > mm/memcontrol.c | 172 +++++++++++++++++++++++++++++++++++++-- > mm/vmscan.c | 39 +++++++- > 5 files changed, 303 insertions(+), 18 deletions(-) > > Index: mmotm-0710/Documentation/cgroups/memory.txt > =================================================================== > --- mmotm-0710.orig/Documentation/cgroups/memory.txt > +++ mmotm-0710/Documentation/cgroups/memory.txt > @@ -380,7 +380,7 @@ will be charged as a new owner of it. > > 5.2 stat file > > -memory.stat file includes following statistics > +5.2.1 memory.stat file includes following statistics > > # per-memory cgroup local status > cache - # of bytes of page cache memory. > @@ -438,6 +438,89 @@ Note: > file_mapped is accounted only when the memory cgroup is owner of page > cache.) > > +5.2.2 memory.vmscan_stat > + > +memory.vmscan_stat includes statistics information for memory scanning and > +freeing, reclaiming. The statistics shows memory scanning information since > +memory cgroup creation and can be reset to 0 by writing 0 as > + > + #echo 0 > ../memory.vmscan_stat > + > +This file contains following statistics. > + > +[param]_[file_or_anon]_pages_by_[reason]_[under_heararchy] > +[param]_elapsed_ns_by_[reason]_[under_hierarchy] > + > +For example, > + > + scanned_file_pages_by_limit indicates the number of scanned > + file pages at vmscan. > + > +Now, 3 parameters are supported > + > + scanned - the number of pages scanned by vmscan > + rotated - the number of pages activated at vmscan > + freed - the number of pages freed by vmscan > + > +If "rotated" is high against scanned/freed, the memcg seems busy. > + > +Now, 2 reason are supported > + > + limit - the memory cgroup's limit > + system - global memory pressure + softlimit > + (global memory pressure not under softlimit is not handled now) > + > +When under_hierarchy is added in the tail, the number indicates the > +total memcg scan of its children and itself. In your implementation, statistics are only accounted to the memcg triggering the limit and the respectively scanned memcgs. Consider the following setup: A / \ B C / D If D tries to charge but hits the limit of A, then B's hierarchy counters do not reflect the reclaim activity resulting in D. That's not consistent with how hierarchy counters usually operate, and neither with how you documented it. On a non-technical note: as Ying Han and I were the other two people working on reclaim and statistics, it really irks me that neither of us were CCd on this. Especially on such a controversial change. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v3] memcg: add memory.vmscan_stat @ 2011-08-08 12:43 ` Johannes Weiner 0 siblings, 0 replies; 54+ messages in thread From: Johannes Weiner @ 2011-08-08 12:43 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: linux-mm, linux-kernel, nishimura, Michal Hocko, akpm, abrestic On Fri, Jul 22, 2011 at 05:15:40PM +0900, KAMEZAWA Hiroyuki wrote: > [PATCH] add memory.vmscan_stat > > commit log of commit 0ae5e89 " memcg: count the soft_limit reclaim in..." > says it adds scanning stats to memory.stat file. But it doesn't because > we considered we needed to make a concensus for such new APIs. > > This patch is a trial to add memory.scan_stat. This shows > - the number of scanned pages(total, anon, file) > - the number of rotated pages(total, anon, file) > - the number of freed pages(total, anon, file) > - the number of elaplsed time (including sleep/pause time) > > for both of direct/soft reclaim. > > The biggest difference with oringinal Ying's one is that this file > can be reset by some write, as > > # echo 0 ...../memory.scan_stat > > Example of output is here. This is a result after make -j 6 kernel > under 300M limit. > > [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat > [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat > scanned_pages_by_limit 9471864 > scanned_anon_pages_by_limit 6640629 > scanned_file_pages_by_limit 2831235 > rotated_pages_by_limit 4243974 > rotated_anon_pages_by_limit 3971968 > rotated_file_pages_by_limit 272006 > freed_pages_by_limit 2318492 > freed_anon_pages_by_limit 962052 > freed_file_pages_by_limit 1356440 > elapsed_ns_by_limit 351386416101 > scanned_pages_by_system 0 > scanned_anon_pages_by_system 0 > scanned_file_pages_by_system 0 > rotated_pages_by_system 0 > rotated_anon_pages_by_system 0 > rotated_file_pages_by_system 0 > freed_pages_by_system 0 > freed_anon_pages_by_system 0 > freed_file_pages_by_system 0 > elapsed_ns_by_system 0 > scanned_pages_by_limit_under_hierarchy 9471864 > scanned_anon_pages_by_limit_under_hierarchy 6640629 > scanned_file_pages_by_limit_under_hierarchy 2831235 > rotated_pages_by_limit_under_hierarchy 4243974 > rotated_anon_pages_by_limit_under_hierarchy 3971968 > rotated_file_pages_by_limit_under_hierarchy 272006 > freed_pages_by_limit_under_hierarchy 2318492 > freed_anon_pages_by_limit_under_hierarchy 962052 > freed_file_pages_by_limit_under_hierarchy 1356440 > elapsed_ns_by_limit_under_hierarchy 351386416101 > scanned_pages_by_system_under_hierarchy 0 > scanned_anon_pages_by_system_under_hierarchy 0 > scanned_file_pages_by_system_under_hierarchy 0 > rotated_pages_by_system_under_hierarchy 0 > rotated_anon_pages_by_system_under_hierarchy 0 > rotated_file_pages_by_system_under_hierarchy 0 > freed_pages_by_system_under_hierarchy 0 > freed_anon_pages_by_system_under_hierarchy 0 > freed_file_pages_by_system_under_hierarchy 0 > elapsed_ns_by_system_under_hierarchy 0 > > total_xxxx is for hierarchy management. > > This will be useful for further memcg developments and need to be > developped before we do some complicated rework on LRU/softlimit > management. > > This patch adds a new struct memcg_scanrecord into scan_control struct. > sc->nr_scanned at el is not designed for exporting information. For example, > nr_scanned is reset frequentrly and incremented +2 at scanning mapped pages. > > For avoiding complexity, I added a new param in scan_control which is for > exporting scanning score. > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > > Changelog: > - fixed the trigger for recording nr_freed in shrink_inactive_list() > Changelog: > - renamed as vmscan_stat > - handle file/anon > - added "rotated" > - changed names of param in vmscan_stat. > --- > Documentation/cgroups/memory.txt | 85 +++++++++++++++++++ > include/linux/memcontrol.h | 19 ++++ > include/linux/swap.h | 6 - > mm/memcontrol.c | 172 +++++++++++++++++++++++++++++++++++++-- > mm/vmscan.c | 39 +++++++- > 5 files changed, 303 insertions(+), 18 deletions(-) > > Index: mmotm-0710/Documentation/cgroups/memory.txt > =================================================================== > --- mmotm-0710.orig/Documentation/cgroups/memory.txt > +++ mmotm-0710/Documentation/cgroups/memory.txt > @@ -380,7 +380,7 @@ will be charged as a new owner of it. > > 5.2 stat file > > -memory.stat file includes following statistics > +5.2.1 memory.stat file includes following statistics > > # per-memory cgroup local status > cache - # of bytes of page cache memory. > @@ -438,6 +438,89 @@ Note: > file_mapped is accounted only when the memory cgroup is owner of page > cache.) > > +5.2.2 memory.vmscan_stat > + > +memory.vmscan_stat includes statistics information for memory scanning and > +freeing, reclaiming. The statistics shows memory scanning information since > +memory cgroup creation and can be reset to 0 by writing 0 as > + > + #echo 0 > ../memory.vmscan_stat > + > +This file contains following statistics. > + > +[param]_[file_or_anon]_pages_by_[reason]_[under_heararchy] > +[param]_elapsed_ns_by_[reason]_[under_hierarchy] > + > +For example, > + > + scanned_file_pages_by_limit indicates the number of scanned > + file pages at vmscan. > + > +Now, 3 parameters are supported > + > + scanned - the number of pages scanned by vmscan > + rotated - the number of pages activated at vmscan > + freed - the number of pages freed by vmscan > + > +If "rotated" is high against scanned/freed, the memcg seems busy. > + > +Now, 2 reason are supported > + > + limit - the memory cgroup's limit > + system - global memory pressure + softlimit > + (global memory pressure not under softlimit is not handled now) > + > +When under_hierarchy is added in the tail, the number indicates the > +total memcg scan of its children and itself. In your implementation, statistics are only accounted to the memcg triggering the limit and the respectively scanned memcgs. Consider the following setup: A / \ B C / D If D tries to charge but hits the limit of A, then B's hierarchy counters do not reflect the reclaim activity resulting in D. That's not consistent with how hierarchy counters usually operate, and neither with how you documented it. On a non-technical note: as Ying Han and I were the other two people working on reclaim and statistics, it really irks me that neither of us were CCd on this. Especially on such a controversial change. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v3] memcg: add memory.vmscan_stat 2011-08-08 12:43 ` Johannes Weiner @ 2011-08-08 23:33 ` KAMEZAWA Hiroyuki -1 siblings, 0 replies; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-08-08 23:33 UTC (permalink / raw) To: Johannes Weiner Cc: linux-mm, linux-kernel, nishimura, Michal Hocko, akpm, abrestic On Mon, 8 Aug 2011 14:43:33 +0200 Johannes Weiner <jweiner@redhat.com> wrote: > On Fri, Jul 22, 2011 at 05:15:40PM +0900, KAMEZAWA Hiroyuki wrote: > > [PATCH] add memory.vmscan_stat > > > > commit log of commit 0ae5e89 " memcg: count the soft_limit reclaim in..." > > says it adds scanning stats to memory.stat file. But it doesn't because > > we considered we needed to make a concensus for such new APIs. > > > > This patch is a trial to add memory.scan_stat. This shows > > - the number of scanned pages(total, anon, file) > > - the number of rotated pages(total, anon, file) > > - the number of freed pages(total, anon, file) > > - the number of elaplsed time (including sleep/pause time) > > > > for both of direct/soft reclaim. > > > > The biggest difference with oringinal Ying's one is that this file > > can be reset by some write, as > > > > # echo 0 ...../memory.scan_stat > > > > Example of output is here. This is a result after make -j 6 kernel > > under 300M limit. > > > > [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat > > [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat > > scanned_pages_by_limit 9471864 > > scanned_anon_pages_by_limit 6640629 > > scanned_file_pages_by_limit 2831235 > > rotated_pages_by_limit 4243974 > > rotated_anon_pages_by_limit 3971968 > > rotated_file_pages_by_limit 272006 > > freed_pages_by_limit 2318492 > > freed_anon_pages_by_limit 962052 > > freed_file_pages_by_limit 1356440 > > elapsed_ns_by_limit 351386416101 > > scanned_pages_by_system 0 > > scanned_anon_pages_by_system 0 > > scanned_file_pages_by_system 0 > > rotated_pages_by_system 0 > > rotated_anon_pages_by_system 0 > > rotated_file_pages_by_system 0 > > freed_pages_by_system 0 > > freed_anon_pages_by_system 0 > > freed_file_pages_by_system 0 > > elapsed_ns_by_system 0 > > scanned_pages_by_limit_under_hierarchy 9471864 > > scanned_anon_pages_by_limit_under_hierarchy 6640629 > > scanned_file_pages_by_limit_under_hierarchy 2831235 > > rotated_pages_by_limit_under_hierarchy 4243974 > > rotated_anon_pages_by_limit_under_hierarchy 3971968 > > rotated_file_pages_by_limit_under_hierarchy 272006 > > freed_pages_by_limit_under_hierarchy 2318492 > > freed_anon_pages_by_limit_under_hierarchy 962052 > > freed_file_pages_by_limit_under_hierarchy 1356440 > > elapsed_ns_by_limit_under_hierarchy 351386416101 > > scanned_pages_by_system_under_hierarchy 0 > > scanned_anon_pages_by_system_under_hierarchy 0 > > scanned_file_pages_by_system_under_hierarchy 0 > > rotated_pages_by_system_under_hierarchy 0 > > rotated_anon_pages_by_system_under_hierarchy 0 > > rotated_file_pages_by_system_under_hierarchy 0 > > freed_pages_by_system_under_hierarchy 0 > > freed_anon_pages_by_system_under_hierarchy 0 > > freed_file_pages_by_system_under_hierarchy 0 > > elapsed_ns_by_system_under_hierarchy 0 > > > > total_xxxx is for hierarchy management. > > > > This will be useful for further memcg developments and need to be > > developped before we do some complicated rework on LRU/softlimit > > management. > > > > This patch adds a new struct memcg_scanrecord into scan_control struct. > > sc->nr_scanned at el is not designed for exporting information. For example, > > nr_scanned is reset frequentrly and incremented +2 at scanning mapped pages. > > > > For avoiding complexity, I added a new param in scan_control which is for > > exporting scanning score. > > > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > > > > Changelog: > > - fixed the trigger for recording nr_freed in shrink_inactive_list() > > Changelog: > > - renamed as vmscan_stat > > - handle file/anon > > - added "rotated" > > - changed names of param in vmscan_stat. > > --- > > Documentation/cgroups/memory.txt | 85 +++++++++++++++++++ > > include/linux/memcontrol.h | 19 ++++ > > include/linux/swap.h | 6 - > > mm/memcontrol.c | 172 +++++++++++++++++++++++++++++++++++++-- > > mm/vmscan.c | 39 +++++++- > > 5 files changed, 303 insertions(+), 18 deletions(-) > > > > Index: mmotm-0710/Documentation/cgroups/memory.txt > > =================================================================== > > --- mmotm-0710.orig/Documentation/cgroups/memory.txt > > +++ mmotm-0710/Documentation/cgroups/memory.txt > > @@ -380,7 +380,7 @@ will be charged as a new owner of it. > > > > 5.2 stat file > > > > -memory.stat file includes following statistics > > +5.2.1 memory.stat file includes following statistics > > > > # per-memory cgroup local status > > cache - # of bytes of page cache memory. > > @@ -438,6 +438,89 @@ Note: > > file_mapped is accounted only when the memory cgroup is owner of page > > cache.) > > > > +5.2.2 memory.vmscan_stat > > + > > +memory.vmscan_stat includes statistics information for memory scanning and > > +freeing, reclaiming. The statistics shows memory scanning information since > > +memory cgroup creation and can be reset to 0 by writing 0 as > > + > > + #echo 0 > ../memory.vmscan_stat > > + > > +This file contains following statistics. > > + > > +[param]_[file_or_anon]_pages_by_[reason]_[under_heararchy] > > +[param]_elapsed_ns_by_[reason]_[under_hierarchy] > > + > > +For example, > > + > > + scanned_file_pages_by_limit indicates the number of scanned > > + file pages at vmscan. > > + > > +Now, 3 parameters are supported > > + > > + scanned - the number of pages scanned by vmscan > > + rotated - the number of pages activated at vmscan > > + freed - the number of pages freed by vmscan > > + > > +If "rotated" is high against scanned/freed, the memcg seems busy. > > + > > +Now, 2 reason are supported > > + > > + limit - the memory cgroup's limit > > + system - global memory pressure + softlimit > > + (global memory pressure not under softlimit is not handled now) > > + > > +When under_hierarchy is added in the tail, the number indicates the > > +total memcg scan of its children and itself. > > In your implementation, statistics are only accounted to the memcg > triggering the limit and the respectively scanned memcgs. > > Consider the following setup: > > A > / \ > B C > / > D > > If D tries to charge but hits the limit of A, then B's hierarchy > counters do not reflect the reclaim activity resulting in D. > yes, as I expected. > That's not consistent with how hierarchy counters usually operate, and > neither with how you documented it. > Hmm. > On a non-technical note: as Ying Han and I were the other two people > working on reclaim and statistics, it really irks me that neither of > us were CCd on this. Especially on such a controversial change. I always drop CC if no reply/review comes. Thanks, -Kame ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v3] memcg: add memory.vmscan_stat @ 2011-08-08 23:33 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-08-08 23:33 UTC (permalink / raw) To: Johannes Weiner Cc: linux-mm, linux-kernel, nishimura, Michal Hocko, akpm, abrestic On Mon, 8 Aug 2011 14:43:33 +0200 Johannes Weiner <jweiner@redhat.com> wrote: > On Fri, Jul 22, 2011 at 05:15:40PM +0900, KAMEZAWA Hiroyuki wrote: > > [PATCH] add memory.vmscan_stat > > > > commit log of commit 0ae5e89 " memcg: count the soft_limit reclaim in..." > > says it adds scanning stats to memory.stat file. But it doesn't because > > we considered we needed to make a concensus for such new APIs. > > > > This patch is a trial to add memory.scan_stat. This shows > > - the number of scanned pages(total, anon, file) > > - the number of rotated pages(total, anon, file) > > - the number of freed pages(total, anon, file) > > - the number of elaplsed time (including sleep/pause time) > > > > for both of direct/soft reclaim. > > > > The biggest difference with oringinal Ying's one is that this file > > can be reset by some write, as > > > > # echo 0 ...../memory.scan_stat > > > > Example of output is here. This is a result after make -j 6 kernel > > under 300M limit. > > > > [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat > > [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat > > scanned_pages_by_limit 9471864 > > scanned_anon_pages_by_limit 6640629 > > scanned_file_pages_by_limit 2831235 > > rotated_pages_by_limit 4243974 > > rotated_anon_pages_by_limit 3971968 > > rotated_file_pages_by_limit 272006 > > freed_pages_by_limit 2318492 > > freed_anon_pages_by_limit 962052 > > freed_file_pages_by_limit 1356440 > > elapsed_ns_by_limit 351386416101 > > scanned_pages_by_system 0 > > scanned_anon_pages_by_system 0 > > scanned_file_pages_by_system 0 > > rotated_pages_by_system 0 > > rotated_anon_pages_by_system 0 > > rotated_file_pages_by_system 0 > > freed_pages_by_system 0 > > freed_anon_pages_by_system 0 > > freed_file_pages_by_system 0 > > elapsed_ns_by_system 0 > > scanned_pages_by_limit_under_hierarchy 9471864 > > scanned_anon_pages_by_limit_under_hierarchy 6640629 > > scanned_file_pages_by_limit_under_hierarchy 2831235 > > rotated_pages_by_limit_under_hierarchy 4243974 > > rotated_anon_pages_by_limit_under_hierarchy 3971968 > > rotated_file_pages_by_limit_under_hierarchy 272006 > > freed_pages_by_limit_under_hierarchy 2318492 > > freed_anon_pages_by_limit_under_hierarchy 962052 > > freed_file_pages_by_limit_under_hierarchy 1356440 > > elapsed_ns_by_limit_under_hierarchy 351386416101 > > scanned_pages_by_system_under_hierarchy 0 > > scanned_anon_pages_by_system_under_hierarchy 0 > > scanned_file_pages_by_system_under_hierarchy 0 > > rotated_pages_by_system_under_hierarchy 0 > > rotated_anon_pages_by_system_under_hierarchy 0 > > rotated_file_pages_by_system_under_hierarchy 0 > > freed_pages_by_system_under_hierarchy 0 > > freed_anon_pages_by_system_under_hierarchy 0 > > freed_file_pages_by_system_under_hierarchy 0 > > elapsed_ns_by_system_under_hierarchy 0 > > > > total_xxxx is for hierarchy management. > > > > This will be useful for further memcg developments and need to be > > developped before we do some complicated rework on LRU/softlimit > > management. > > > > This patch adds a new struct memcg_scanrecord into scan_control struct. > > sc->nr_scanned at el is not designed for exporting information. For example, > > nr_scanned is reset frequentrly and incremented +2 at scanning mapped pages. > > > > For avoiding complexity, I added a new param in scan_control which is for > > exporting scanning score. > > > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > > > > Changelog: > > - fixed the trigger for recording nr_freed in shrink_inactive_list() > > Changelog: > > - renamed as vmscan_stat > > - handle file/anon > > - added "rotated" > > - changed names of param in vmscan_stat. > > --- > > Documentation/cgroups/memory.txt | 85 +++++++++++++++++++ > > include/linux/memcontrol.h | 19 ++++ > > include/linux/swap.h | 6 - > > mm/memcontrol.c | 172 +++++++++++++++++++++++++++++++++++++-- > > mm/vmscan.c | 39 +++++++- > > 5 files changed, 303 insertions(+), 18 deletions(-) > > > > Index: mmotm-0710/Documentation/cgroups/memory.txt > > =================================================================== > > --- mmotm-0710.orig/Documentation/cgroups/memory.txt > > +++ mmotm-0710/Documentation/cgroups/memory.txt > > @@ -380,7 +380,7 @@ will be charged as a new owner of it. > > > > 5.2 stat file > > > > -memory.stat file includes following statistics > > +5.2.1 memory.stat file includes following statistics > > > > # per-memory cgroup local status > > cache - # of bytes of page cache memory. > > @@ -438,6 +438,89 @@ Note: > > file_mapped is accounted only when the memory cgroup is owner of page > > cache.) > > > > +5.2.2 memory.vmscan_stat > > + > > +memory.vmscan_stat includes statistics information for memory scanning and > > +freeing, reclaiming. The statistics shows memory scanning information since > > +memory cgroup creation and can be reset to 0 by writing 0 as > > + > > + #echo 0 > ../memory.vmscan_stat > > + > > +This file contains following statistics. > > + > > +[param]_[file_or_anon]_pages_by_[reason]_[under_heararchy] > > +[param]_elapsed_ns_by_[reason]_[under_hierarchy] > > + > > +For example, > > + > > + scanned_file_pages_by_limit indicates the number of scanned > > + file pages at vmscan. > > + > > +Now, 3 parameters are supported > > + > > + scanned - the number of pages scanned by vmscan > > + rotated - the number of pages activated at vmscan > > + freed - the number of pages freed by vmscan > > + > > +If "rotated" is high against scanned/freed, the memcg seems busy. > > + > > +Now, 2 reason are supported > > + > > + limit - the memory cgroup's limit > > + system - global memory pressure + softlimit > > + (global memory pressure not under softlimit is not handled now) > > + > > +When under_hierarchy is added in the tail, the number indicates the > > +total memcg scan of its children and itself. > > In your implementation, statistics are only accounted to the memcg > triggering the limit and the respectively scanned memcgs. > > Consider the following setup: > > A > / \ > B C > / > D > > If D tries to charge but hits the limit of A, then B's hierarchy > counters do not reflect the reclaim activity resulting in D. > yes, as I expected. > That's not consistent with how hierarchy counters usually operate, and > neither with how you documented it. > Hmm. > On a non-technical note: as Ying Han and I were the other two people > working on reclaim and statistics, it really irks me that neither of > us were CCd on this. Especially on such a controversial change. I always drop CC if no reply/review comes. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v3] memcg: add memory.vmscan_stat 2011-08-08 23:33 ` KAMEZAWA Hiroyuki @ 2011-08-09 8:01 ` Johannes Weiner -1 siblings, 0 replies; 54+ messages in thread From: Johannes Weiner @ 2011-08-09 8:01 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: linux-mm, linux-kernel, nishimura, Michal Hocko, akpm, abrestic On Tue, Aug 09, 2011 at 08:33:45AM +0900, KAMEZAWA Hiroyuki wrote: > On Mon, 8 Aug 2011 14:43:33 +0200 > Johannes Weiner <jweiner@redhat.com> wrote: > > On a non-technical note: as Ying Han and I were the other two people > > working on reclaim and statistics, it really irks me that neither of > > us were CCd on this. Especially on such a controversial change. > > I always drop CC if no reply/review comes. There is always the possibility that a single mail in an otherwise unrelated patch series is overlooked (especially while on vacation ;). Getting CCd on revisions and -mm inclusion is a really nice reminder. Unless there is a really good reason not to (is there ever?), could you please keep CCs? ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v3] memcg: add memory.vmscan_stat @ 2011-08-09 8:01 ` Johannes Weiner 0 siblings, 0 replies; 54+ messages in thread From: Johannes Weiner @ 2011-08-09 8:01 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: linux-mm, linux-kernel, nishimura, Michal Hocko, akpm, abrestic On Tue, Aug 09, 2011 at 08:33:45AM +0900, KAMEZAWA Hiroyuki wrote: > On Mon, 8 Aug 2011 14:43:33 +0200 > Johannes Weiner <jweiner@redhat.com> wrote: > > On a non-technical note: as Ying Han and I were the other two people > > working on reclaim and statistics, it really irks me that neither of > > us were CCd on this. Especially on such a controversial change. > > I always drop CC if no reply/review comes. There is always the possibility that a single mail in an otherwise unrelated patch series is overlooked (especially while on vacation ;). Getting CCd on revisions and -mm inclusion is a really nice reminder. Unless there is a really good reason not to (is there ever?), could you please keep CCs? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v3] memcg: add memory.vmscan_stat 2011-08-09 8:01 ` Johannes Weiner @ 2011-08-09 8:01 ` KAMEZAWA Hiroyuki -1 siblings, 0 replies; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-08-09 8:01 UTC (permalink / raw) To: Johannes Weiner Cc: linux-mm, linux-kernel, nishimura, Michal Hocko, akpm, abrestic On Tue, 9 Aug 2011 10:01:59 +0200 Johannes Weiner <jweiner@redhat.com> wrote: > On Tue, Aug 09, 2011 at 08:33:45AM +0900, KAMEZAWA Hiroyuki wrote: > > On Mon, 8 Aug 2011 14:43:33 +0200 > > Johannes Weiner <jweiner@redhat.com> wrote: > > > On a non-technical note: as Ying Han and I were the other two people > > > working on reclaim and statistics, it really irks me that neither of > > > us were CCd on this. Especially on such a controversial change. > > > > I always drop CC if no reply/review comes. > > There is always the possibility that a single mail in an otherwise > unrelated patch series is overlooked (especially while on vacation ;). > Getting CCd on revisions and -mm inclusion is a really nice reminder. > > Unless there is a really good reason not to (is there ever?), could > you please keep CCs? > Ok, if you want, I'll CC always. I myself just don't like to get 3 copies of mails when I don't have much interests ;) Thanks, -Kame ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v3] memcg: add memory.vmscan_stat @ 2011-08-09 8:01 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-08-09 8:01 UTC (permalink / raw) To: Johannes Weiner Cc: linux-mm, linux-kernel, nishimura, Michal Hocko, akpm, abrestic On Tue, 9 Aug 2011 10:01:59 +0200 Johannes Weiner <jweiner@redhat.com> wrote: > On Tue, Aug 09, 2011 at 08:33:45AM +0900, KAMEZAWA Hiroyuki wrote: > > On Mon, 8 Aug 2011 14:43:33 +0200 > > Johannes Weiner <jweiner@redhat.com> wrote: > > > On a non-technical note: as Ying Han and I were the other two people > > > working on reclaim and statistics, it really irks me that neither of > > > us were CCd on this. Especially on such a controversial change. > > > > I always drop CC if no reply/review comes. > > There is always the possibility that a single mail in an otherwise > unrelated patch series is overlooked (especially while on vacation ;). > Getting CCd on revisions and -mm inclusion is a really nice reminder. > > Unless there is a really good reason not to (is there ever?), could > you please keep CCs? > Ok, if you want, I'll CC always. I myself just don't like to get 3 copies of mails when I don't have much interests ;) Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v3] memcg: add memory.vmscan_stat 2011-08-09 8:01 ` KAMEZAWA Hiroyuki @ 2011-08-13 1:04 ` Ying Han -1 siblings, 0 replies; 54+ messages in thread From: Ying Han @ 2011-08-13 1:04 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Johannes Weiner, linux-mm, linux-kernel, nishimura, Michal Hocko, akpm, abrestic On Tue, Aug 9, 2011 at 1:01 AM, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > On Tue, 9 Aug 2011 10:01:59 +0200 > Johannes Weiner <jweiner@redhat.com> wrote: > > > On Tue, Aug 09, 2011 at 08:33:45AM +0900, KAMEZAWA Hiroyuki wrote: > > > On Mon, 8 Aug 2011 14:43:33 +0200 > > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > On a non-technical note: as Ying Han and I were the other two people > > > > working on reclaim and statistics, it really irks me that neither of > > > > us were CCd on this. Especially on such a controversial change. > > > > > > I always drop CC if no reply/review comes. > > > > There is always the possibility that a single mail in an otherwise > > unrelated patch series is overlooked (especially while on vacation ;). > > Getting CCd on revisions and -mm inclusion is a really nice reminder. > > > > Unless there is a really good reason not to (is there ever?), could > > you please keep CCs? > > > > Ok, if you want, I'll CC always. > I myself just don't like to get 3 copies of mails when I don't have > much interests ;) > > Thanks, > -Kame Hi Kame, Johannes, Sorry for getting into this thread late and here are some comments: There are few patches that we've been working on which could change the memcg reclaim path quite bit. I wonder if they have chance to be merged later, this patch might need to be adjusted accordingly as well. If the ABI needs to be changed, that would be hard. There is a patch Andrew (abrestic@) has been testing which adds the same memory.vmscan_stat, but based on some page reclaim patches. (Mainly the memcg-aware global reclaim from Johannes ). And it does adjust to the hierarchical reclaim change as Johannes mentioned. So, may I suggest us to hold on this patch for now? while the other page reclaim changes being settled, we can then add it in. Thanks --Ying > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH v3] memcg: add memory.vmscan_stat @ 2011-08-13 1:04 ` Ying Han 0 siblings, 0 replies; 54+ messages in thread From: Ying Han @ 2011-08-13 1:04 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Johannes Weiner, linux-mm, linux-kernel, nishimura, Michal Hocko, akpm, abrestic On Tue, Aug 9, 2011 at 1:01 AM, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > On Tue, 9 Aug 2011 10:01:59 +0200 > Johannes Weiner <jweiner@redhat.com> wrote: > > > On Tue, Aug 09, 2011 at 08:33:45AM +0900, KAMEZAWA Hiroyuki wrote: > > > On Mon, 8 Aug 2011 14:43:33 +0200 > > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > On a non-technical note: as Ying Han and I were the other two people > > > > working on reclaim and statistics, it really irks me that neither of > > > > us were CCd on this. Especially on such a controversial change. > > > > > > I always drop CC if no reply/review comes. > > > > There is always the possibility that a single mail in an otherwise > > unrelated patch series is overlooked (especially while on vacation ;). > > Getting CCd on revisions and -mm inclusion is a really nice reminder. > > > > Unless there is a really good reason not to (is there ever?), could > > you please keep CCs? > > > > Ok, if you want, I'll CC always. > I myself just don't like to get 3 copies of mails when I don't have > much interests ;) > > Thanks, > -Kame Hi Kame, Johannes, Sorry for getting into this thread late and here are some comments: There are few patches that we've been working on which could change the memcg reclaim path quite bit. I wonder if they have chance to be merged later, this patch might need to be adjusted accordingly as well. If the ABI needs to be changed, that would be hard. There is a patch Andrew (abrestic@) has been testing which adds the same memory.vmscan_stat, but based on some page reclaim patches. (Mainly the memcg-aware global reclaim from Johannes ). And it does adjust to the hierarchical reclaim change as Johannes mentioned. So, may I suggest us to hold on this patch for now? while the other page reclaim changes being settled, we can then add it in. Thanks --Ying > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* [patch] Revert "memcg: add memory.vmscan_stat" 2011-08-08 23:33 ` KAMEZAWA Hiroyuki @ 2011-08-29 15:51 ` Johannes Weiner -1 siblings, 0 replies; 54+ messages in thread From: Johannes Weiner @ 2011-08-29 15:51 UTC (permalink / raw) To: Andrew Morton Cc: Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, KAMEZAWA Hiroyuki, linux-mm, linux-kernel On Tue, Aug 09, 2011 at 08:33:45AM +0900, KAMEZAWA Hiroyuki wrote: > On Mon, 8 Aug 2011 14:43:33 +0200 > Johannes Weiner <jweiner@redhat.com> wrote: > > > On Fri, Jul 22, 2011 at 05:15:40PM +0900, KAMEZAWA Hiroyuki wrote: > > > +When under_hierarchy is added in the tail, the number indicates the > > > +total memcg scan of its children and itself. > > > > In your implementation, statistics are only accounted to the memcg > > triggering the limit and the respectively scanned memcgs. > > > > Consider the following setup: > > > > A > > / \ > > B C > > / > > D > > > > If D tries to charge but hits the limit of A, then B's hierarchy > > counters do not reflect the reclaim activity resulting in D. > > > yes, as I expected. Andrew, with a flawed design, the author unwilling to fix it, and two NAKs, can we please revert this before the release? This only got in silently because KAMEZAWA-san dropped all parties involed in the discussions around this change from the Cc list of subsequent submissions. --- From: Johannes Weiner <jweiner@redhat.com> Subject: [patch] Revert "memcg: add memory.vmscan_stat" This reverts commit 82f9d486e59f588c7d100865c36510644abda356. The implementation of per-memcg reclaim statistics violates how memcg hierarchies usually behave: hierarchically. The reclaim statistics are accounted to child memcgs and the parent hitting the limit, but not to hierarchy levels in between. Usually, hierarchical statistics are perfectly recursive, with each level representing the sum of itself and all its children. Since this exports statistics to userspace, this may lead to confusion and problems with changing things after the release, so revert it now, we can try again later. Conflicts: mm/vmscan.c Signed-off-by: Johannes Weiner <jweiner@redhat.com> --- Documentation/cgroups/memory.txt | 85 +------------------ include/linux/memcontrol.h | 19 ---- include/linux/swap.h | 6 ++ mm/memcontrol.c | 172 ++------------------------------------ mm/vmscan.c | 39 +-------- 5 files changed, 18 insertions(+), 303 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 6f3c598..06eb6d9 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -380,7 +380,7 @@ will be charged as a new owner of it. 5.2 stat file -5.2.1 memory.stat file includes following statistics +memory.stat file includes following statistics # per-memory cgroup local status cache - # of bytes of page cache memory. @@ -438,89 +438,6 @@ Note: file_mapped is accounted only when the memory cgroup is owner of page cache.) -5.2.2 memory.vmscan_stat - -memory.vmscan_stat includes statistics information for memory scanning and -freeing, reclaiming. The statistics shows memory scanning information since -memory cgroup creation and can be reset to 0 by writing 0 as - - #echo 0 > ../memory.vmscan_stat - -This file contains following statistics. - -[param]_[file_or_anon]_pages_by_[reason]_[under_heararchy] -[param]_elapsed_ns_by_[reason]_[under_hierarchy] - -For example, - - scanned_file_pages_by_limit indicates the number of scanned - file pages at vmscan. - -Now, 3 parameters are supported - - scanned - the number of pages scanned by vmscan - rotated - the number of pages activated at vmscan - freed - the number of pages freed by vmscan - -If "rotated" is high against scanned/freed, the memcg seems busy. - -Now, 2 reason are supported - - limit - the memory cgroup's limit - system - global memory pressure + softlimit - (global memory pressure not under softlimit is not handled now) - -When under_hierarchy is added in the tail, the number indicates the -total memcg scan of its children and itself. - -elapsed_ns is a elapsed time in nanosecond. This may include sleep time -and not indicates CPU usage. So, please take this as just showing -latency. - -Here is an example. - -# cat /cgroup/memory/A/memory.vmscan_stat -scanned_pages_by_limit 9471864 -scanned_anon_pages_by_limit 6640629 -scanned_file_pages_by_limit 2831235 -rotated_pages_by_limit 4243974 -rotated_anon_pages_by_limit 3971968 -rotated_file_pages_by_limit 272006 -freed_pages_by_limit 2318492 -freed_anon_pages_by_limit 962052 -freed_file_pages_by_limit 1356440 -elapsed_ns_by_limit 351386416101 -scanned_pages_by_system 0 -scanned_anon_pages_by_system 0 -scanned_file_pages_by_system 0 -rotated_pages_by_system 0 -rotated_anon_pages_by_system 0 -rotated_file_pages_by_system 0 -freed_pages_by_system 0 -freed_anon_pages_by_system 0 -freed_file_pages_by_system 0 -elapsed_ns_by_system 0 -scanned_pages_by_limit_under_hierarchy 9471864 -scanned_anon_pages_by_limit_under_hierarchy 6640629 -scanned_file_pages_by_limit_under_hierarchy 2831235 -rotated_pages_by_limit_under_hierarchy 4243974 -rotated_anon_pages_by_limit_under_hierarchy 3971968 -rotated_file_pages_by_limit_under_hierarchy 272006 -freed_pages_by_limit_under_hierarchy 2318492 -freed_anon_pages_by_limit_under_hierarchy 962052 -freed_file_pages_by_limit_under_hierarchy 1356440 -elapsed_ns_by_limit_under_hierarchy 351386416101 -scanned_pages_by_system_under_hierarchy 0 -scanned_anon_pages_by_system_under_hierarchy 0 -scanned_file_pages_by_system_under_hierarchy 0 -rotated_pages_by_system_under_hierarchy 0 -rotated_anon_pages_by_system_under_hierarchy 0 -rotated_file_pages_by_system_under_hierarchy 0 -freed_pages_by_system_under_hierarchy 0 -freed_anon_pages_by_system_under_hierarchy 0 -freed_file_pages_by_system_under_hierarchy 0 -elapsed_ns_by_system_under_hierarchy 0 - 5.3 swappiness Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only. diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 3b535db..343bd76 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -39,16 +39,6 @@ extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan, struct mem_cgroup *mem_cont, int active, int file); -struct memcg_scanrecord { - struct mem_cgroup *mem; /* scanend memory cgroup */ - struct mem_cgroup *root; /* scan target hierarchy root */ - int context; /* scanning context (see memcontrol.c) */ - unsigned long nr_scanned[2]; /* the number of scanned pages */ - unsigned long nr_rotated[2]; /* the number of rotated pages */ - unsigned long nr_freed[2]; /* the number of freed pages */ - unsigned long elapsed; /* nsec of time elapsed while scanning */ -}; - #ifdef CONFIG_CGROUP_MEM_RES_CTLR /* * All "charge" functions with gfp_mask should use GFP_KERNEL or @@ -127,15 +117,6 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page); extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); -extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem, - gfp_t gfp_mask, bool noswap, - struct memcg_scanrecord *rec); -extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, - gfp_t gfp_mask, bool noswap, - struct zone *zone, - struct memcg_scanrecord *rec, - unsigned long *nr_scanned); - #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP extern int do_swap_account; #endif diff --git a/include/linux/swap.h b/include/linux/swap.h index 14d6249..c71f84b 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -252,6 +252,12 @@ static inline void lru_cache_add_file(struct page *page) extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *mask); extern int __isolate_lru_page(struct page *page, int mode, int file); +extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem, + gfp_t gfp_mask, bool noswap); +extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, + gfp_t gfp_mask, bool noswap, + struct zone *zone, + unsigned long *nr_scanned); extern unsigned long shrink_all_memory(unsigned long nr_pages); extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ebd1e86..3508777 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -204,50 +204,6 @@ struct mem_cgroup_eventfd_list { static void mem_cgroup_threshold(struct mem_cgroup *mem); static void mem_cgroup_oom_notify(struct mem_cgroup *mem); -enum { - SCAN_BY_LIMIT, - SCAN_BY_SYSTEM, - NR_SCAN_CONTEXT, - SCAN_BY_SHRINK, /* not recorded now */ -}; - -enum { - SCAN, - SCAN_ANON, - SCAN_FILE, - ROTATE, - ROTATE_ANON, - ROTATE_FILE, - FREED, - FREED_ANON, - FREED_FILE, - ELAPSED, - NR_SCANSTATS, -}; - -struct scanstat { - spinlock_t lock; - unsigned long stats[NR_SCAN_CONTEXT][NR_SCANSTATS]; - unsigned long rootstats[NR_SCAN_CONTEXT][NR_SCANSTATS]; -}; - -const char *scanstat_string[NR_SCANSTATS] = { - "scanned_pages", - "scanned_anon_pages", - "scanned_file_pages", - "rotated_pages", - "rotated_anon_pages", - "rotated_file_pages", - "freed_pages", - "freed_anon_pages", - "freed_file_pages", - "elapsed_ns", -}; -#define SCANSTAT_WORD_LIMIT "_by_limit" -#define SCANSTAT_WORD_SYSTEM "_by_system" -#define SCANSTAT_WORD_HIERARCHY "_under_hierarchy" - - /* * The memory controller data structure. The memory controller controls both * page cache and RSS per cgroup. We would eventually like to provide @@ -313,8 +269,7 @@ struct mem_cgroup { /* For oom notifier event fd */ struct list_head oom_notify; - /* For recording LRU-scan statistics */ - struct scanstat scanstat; + /* * Should we move charges of a task when a task is moved into this * mem_cgroup ? And what type of charges should we move ? @@ -1678,44 +1633,6 @@ bool mem_cgroup_reclaimable(struct mem_cgroup *mem, bool noswap) } #endif -static void __mem_cgroup_record_scanstat(unsigned long *stats, - struct memcg_scanrecord *rec) -{ - - stats[SCAN] += rec->nr_scanned[0] + rec->nr_scanned[1]; - stats[SCAN_ANON] += rec->nr_scanned[0]; - stats[SCAN_FILE] += rec->nr_scanned[1]; - - stats[ROTATE] += rec->nr_rotated[0] + rec->nr_rotated[1]; - stats[ROTATE_ANON] += rec->nr_rotated[0]; - stats[ROTATE_FILE] += rec->nr_rotated[1]; - - stats[FREED] += rec->nr_freed[0] + rec->nr_freed[1]; - stats[FREED_ANON] += rec->nr_freed[0]; - stats[FREED_FILE] += rec->nr_freed[1]; - - stats[ELAPSED] += rec->elapsed; -} - -static void mem_cgroup_record_scanstat(struct memcg_scanrecord *rec) -{ - struct mem_cgroup *mem; - int context = rec->context; - - if (context >= NR_SCAN_CONTEXT) - return; - - mem = rec->mem; - spin_lock(&mem->scanstat.lock); - __mem_cgroup_record_scanstat(mem->scanstat.stats[context], rec); - spin_unlock(&mem->scanstat.lock); - - mem = rec->root; - spin_lock(&mem->scanstat.lock); - __mem_cgroup_record_scanstat(mem->scanstat.rootstats[context], rec); - spin_unlock(&mem->scanstat.lock); -} - /* * Scan the hierarchy if needed to reclaim memory. We remember the last child * we reclaimed from, so that we don't end up penalizing one child extensively @@ -1740,9 +1657,8 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem, bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP; bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK; bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT; - struct memcg_scanrecord rec; unsigned long excess; - unsigned long scanned; + unsigned long nr_scanned; excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT; @@ -1750,15 +1666,6 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem, if (!check_soft && !shrink && root_mem->memsw_is_minimum) noswap = true; - if (shrink) - rec.context = SCAN_BY_SHRINK; - else if (check_soft) - rec.context = SCAN_BY_SYSTEM; - else - rec.context = SCAN_BY_LIMIT; - - rec.root = root_mem; - while (1) { victim = mem_cgroup_select_victim(root_mem); if (victim == root_mem) { @@ -1799,23 +1706,14 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem, css_put(&victim->css); continue; } - rec.mem = victim; - rec.nr_scanned[0] = 0; - rec.nr_scanned[1] = 0; - rec.nr_rotated[0] = 0; - rec.nr_rotated[1] = 0; - rec.nr_freed[0] = 0; - rec.nr_freed[1] = 0; - rec.elapsed = 0; /* we use swappiness of local cgroup */ if (check_soft) { ret = mem_cgroup_shrink_node_zone(victim, gfp_mask, - noswap, zone, &rec, &scanned); - *total_scanned += scanned; + noswap, zone, &nr_scanned); + *total_scanned += nr_scanned; } else ret = try_to_free_mem_cgroup_pages(victim, gfp_mask, - noswap, &rec); - mem_cgroup_record_scanstat(&rec); + noswap); css_put(&victim->css); /* * At shrinking usage, we can't check we should stop here or @@ -3854,18 +3752,14 @@ try_to_free: /* try to free all pages in this cgroup */ shrink = 1; while (nr_retries && mem->res.usage > 0) { - struct memcg_scanrecord rec; int progress; if (signal_pending(current)) { ret = -EINTR; goto out; } - rec.context = SCAN_BY_SHRINK; - rec.mem = mem; - rec.root = mem; progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL, - false, &rec); + false); if (!progress) { nr_retries--; /* maybe some writeback is necessary */ @@ -4709,54 +4603,6 @@ static int mem_control_numa_stat_open(struct inode *unused, struct file *file) } #endif /* CONFIG_NUMA */ -static int mem_cgroup_vmscan_stat_read(struct cgroup *cgrp, - struct cftype *cft, - struct cgroup_map_cb *cb) -{ - struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp); - char string[64]; - int i; - - for (i = 0; i < NR_SCANSTATS; i++) { - strcpy(string, scanstat_string[i]); - strcat(string, SCANSTAT_WORD_LIMIT); - cb->fill(cb, string, mem->scanstat.stats[SCAN_BY_LIMIT][i]); - } - - for (i = 0; i < NR_SCANSTATS; i++) { - strcpy(string, scanstat_string[i]); - strcat(string, SCANSTAT_WORD_SYSTEM); - cb->fill(cb, string, mem->scanstat.stats[SCAN_BY_SYSTEM][i]); - } - - for (i = 0; i < NR_SCANSTATS; i++) { - strcpy(string, scanstat_string[i]); - strcat(string, SCANSTAT_WORD_LIMIT); - strcat(string, SCANSTAT_WORD_HIERARCHY); - cb->fill(cb, string, mem->scanstat.rootstats[SCAN_BY_LIMIT][i]); - } - for (i = 0; i < NR_SCANSTATS; i++) { - strcpy(string, scanstat_string[i]); - strcat(string, SCANSTAT_WORD_SYSTEM); - strcat(string, SCANSTAT_WORD_HIERARCHY); - cb->fill(cb, string, mem->scanstat.rootstats[SCAN_BY_SYSTEM][i]); - } - return 0; -} - -static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp, - unsigned int event) -{ - struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp); - - spin_lock(&mem->scanstat.lock); - memset(&mem->scanstat.stats, 0, sizeof(mem->scanstat.stats)); - memset(&mem->scanstat.rootstats, 0, sizeof(mem->scanstat.rootstats)); - spin_unlock(&mem->scanstat.lock); - return 0; -} - - static struct cftype mem_cgroup_files[] = { { .name = "usage_in_bytes", @@ -4827,11 +4673,6 @@ static struct cftype mem_cgroup_files[] = { .mode = S_IRUGO, }, #endif - { - .name = "vmscan_stat", - .read_map = mem_cgroup_vmscan_stat_read, - .trigger = mem_cgroup_reset_vmscan_stat, - }, }; #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP @@ -5095,7 +4936,6 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) atomic_set(&mem->refcnt, 1); mem->move_charge_at_immigrate = 0; mutex_init(&mem->thresholds_lock); - spin_lock_init(&mem->scanstat.lock); return &mem->css; free_out: __mem_cgroup_free(mem); diff --git a/mm/vmscan.c b/mm/vmscan.c index 04bb6ae..6588746 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -105,7 +105,6 @@ struct scan_control { /* Which cgroup do we reclaim from */ struct mem_cgroup *mem_cgroup; - struct memcg_scanrecord *memcg_record; /* * Nodemask of nodes allowed by the caller. If NULL, all nodes @@ -1349,8 +1348,6 @@ putback_lru_pages(struct zone *zone, struct scan_control *sc, int file = is_file_lru(lru); int numpages = hpage_nr_pages(page); reclaim_stat->recent_rotated[file] += numpages; - if (!scanning_global_lru(sc)) - sc->memcg_record->nr_rotated[file] += numpages; } if (!pagevec_add(&pvec, page)) { spin_unlock_irq(&zone->lru_lock); @@ -1394,10 +1391,6 @@ static noinline_for_stack void update_isolated_counts(struct zone *zone, reclaim_stat->recent_scanned[0] += *nr_anon; reclaim_stat->recent_scanned[1] += *nr_file; - if (!scanning_global_lru(sc)) { - sc->memcg_record->nr_scanned[0] += *nr_anon; - sc->memcg_record->nr_scanned[1] += *nr_file; - } } /* @@ -1511,9 +1504,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, nr_reclaimed += shrink_page_list(&page_list, zone, sc); } - if (!scanning_global_lru(sc)) - sc->memcg_record->nr_freed[file] += nr_reclaimed; - local_irq_disable(); if (current_is_kswapd()) __count_vm_events(KSWAPD_STEAL, nr_reclaimed); @@ -1613,8 +1603,6 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, } reclaim_stat->recent_scanned[file] += nr_taken; - if (!scanning_global_lru(sc)) - sc->memcg_record->nr_scanned[file] += nr_taken; __count_zone_vm_events(PGREFILL, zone, pgscanned); if (file) @@ -1666,8 +1654,6 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, * get_scan_ratio. */ reclaim_stat->recent_rotated[file] += nr_rotated; - if (!scanning_global_lru(sc)) - sc->memcg_record->nr_rotated[file] += nr_rotated; move_active_pages_to_lru(zone, &l_active, LRU_ACTIVE + file * LRU_FILE); @@ -2253,10 +2239,9 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, #ifdef CONFIG_CGROUP_MEM_RES_CTLR unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, - gfp_t gfp_mask, bool noswap, - struct zone *zone, - struct memcg_scanrecord *rec, - unsigned long *scanned) + gfp_t gfp_mask, bool noswap, + struct zone *zone, + unsigned long *nr_scanned) { struct scan_control sc = { .nr_scanned = 0, @@ -2266,9 +2251,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, .may_swap = !noswap, .order = 0, .mem_cgroup = mem, - .memcg_record = rec, }; - ktime_t start, end; sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK); @@ -2277,7 +2260,6 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, sc.may_writepage, sc.gfp_mask); - start = ktime_get(); /* * NOTE: Although we can get the priority field, using it * here is not a good idea, since it limits the pages we can scan. @@ -2286,25 +2268,19 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, * the priority and make it zero. */ shrink_zone(0, zone, &sc); - end = ktime_get(); - - if (rec) - rec->elapsed += ktime_to_ns(ktime_sub(end, start)); - *scanned = sc.nr_scanned; trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed); + *nr_scanned = sc.nr_scanned; return sc.nr_reclaimed; } unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont, gfp_t gfp_mask, - bool noswap, - struct memcg_scanrecord *rec) + bool noswap) { struct zonelist *zonelist; unsigned long nr_reclaimed; - ktime_t start, end; int nid; struct scan_control sc = { .may_writepage = !laptop_mode, @@ -2313,7 +2289,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont, .nr_to_reclaim = SWAP_CLUSTER_MAX, .order = 0, .mem_cgroup = mem_cont, - .memcg_record = rec, .nodemask = NULL, /* we don't care the placement */ .gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK), @@ -2322,7 +2297,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont, .gfp_mask = sc.gfp_mask, }; - start = ktime_get(); /* * Unlike direct reclaim via alloc_pages(), memcg's reclaim doesn't * take care of from where we get pages. So the node where we start the @@ -2337,9 +2311,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont, sc.gfp_mask); nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink); - end = ktime_get(); - if (rec) - rec->elapsed += ktime_to_ns(ktime_sub(end, start)); trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed); -- 1.7.6 ^ permalink raw reply related [flat|nested] 54+ messages in thread
* [patch] Revert "memcg: add memory.vmscan_stat" @ 2011-08-29 15:51 ` Johannes Weiner 0 siblings, 0 replies; 54+ messages in thread From: Johannes Weiner @ 2011-08-29 15:51 UTC (permalink / raw) To: Andrew Morton Cc: Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, KAMEZAWA Hiroyuki, linux-mm, linux-kernel On Tue, Aug 09, 2011 at 08:33:45AM +0900, KAMEZAWA Hiroyuki wrote: > On Mon, 8 Aug 2011 14:43:33 +0200 > Johannes Weiner <jweiner@redhat.com> wrote: > > > On Fri, Jul 22, 2011 at 05:15:40PM +0900, KAMEZAWA Hiroyuki wrote: > > > +When under_hierarchy is added in the tail, the number indicates the > > > +total memcg scan of its children and itself. > > > > In your implementation, statistics are only accounted to the memcg > > triggering the limit and the respectively scanned memcgs. > > > > Consider the following setup: > > > > A > > / \ > > B C > > / > > D > > > > If D tries to charge but hits the limit of A, then B's hierarchy > > counters do not reflect the reclaim activity resulting in D. > > > yes, as I expected. Andrew, with a flawed design, the author unwilling to fix it, and two NAKs, can we please revert this before the release? This only got in silently because KAMEZAWA-san dropped all parties involed in the discussions around this change from the Cc list of subsequent submissions. --- From: Johannes Weiner <jweiner@redhat.com> Subject: [patch] Revert "memcg: add memory.vmscan_stat" This reverts commit 82f9d486e59f588c7d100865c36510644abda356. The implementation of per-memcg reclaim statistics violates how memcg hierarchies usually behave: hierarchically. The reclaim statistics are accounted to child memcgs and the parent hitting the limit, but not to hierarchy levels in between. Usually, hierarchical statistics are perfectly recursive, with each level representing the sum of itself and all its children. Since this exports statistics to userspace, this may lead to confusion and problems with changing things after the release, so revert it now, we can try again later. Conflicts: mm/vmscan.c Signed-off-by: Johannes Weiner <jweiner@redhat.com> --- Documentation/cgroups/memory.txt | 85 +------------------ include/linux/memcontrol.h | 19 ---- include/linux/swap.h | 6 ++ mm/memcontrol.c | 172 ++------------------------------------ mm/vmscan.c | 39 +-------- 5 files changed, 18 insertions(+), 303 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 6f3c598..06eb6d9 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -380,7 +380,7 @@ will be charged as a new owner of it. 5.2 stat file -5.2.1 memory.stat file includes following statistics +memory.stat file includes following statistics # per-memory cgroup local status cache - # of bytes of page cache memory. @@ -438,89 +438,6 @@ Note: file_mapped is accounted only when the memory cgroup is owner of page cache.) -5.2.2 memory.vmscan_stat - -memory.vmscan_stat includes statistics information for memory scanning and -freeing, reclaiming. The statistics shows memory scanning information since -memory cgroup creation and can be reset to 0 by writing 0 as - - #echo 0 > ../memory.vmscan_stat - -This file contains following statistics. - -[param]_[file_or_anon]_pages_by_[reason]_[under_heararchy] -[param]_elapsed_ns_by_[reason]_[under_hierarchy] - -For example, - - scanned_file_pages_by_limit indicates the number of scanned - file pages at vmscan. - -Now, 3 parameters are supported - - scanned - the number of pages scanned by vmscan - rotated - the number of pages activated at vmscan - freed - the number of pages freed by vmscan - -If "rotated" is high against scanned/freed, the memcg seems busy. - -Now, 2 reason are supported - - limit - the memory cgroup's limit - system - global memory pressure + softlimit - (global memory pressure not under softlimit is not handled now) - -When under_hierarchy is added in the tail, the number indicates the -total memcg scan of its children and itself. - -elapsed_ns is a elapsed time in nanosecond. This may include sleep time -and not indicates CPU usage. So, please take this as just showing -latency. - -Here is an example. - -# cat /cgroup/memory/A/memory.vmscan_stat -scanned_pages_by_limit 9471864 -scanned_anon_pages_by_limit 6640629 -scanned_file_pages_by_limit 2831235 -rotated_pages_by_limit 4243974 -rotated_anon_pages_by_limit 3971968 -rotated_file_pages_by_limit 272006 -freed_pages_by_limit 2318492 -freed_anon_pages_by_limit 962052 -freed_file_pages_by_limit 1356440 -elapsed_ns_by_limit 351386416101 -scanned_pages_by_system 0 -scanned_anon_pages_by_system 0 -scanned_file_pages_by_system 0 -rotated_pages_by_system 0 -rotated_anon_pages_by_system 0 -rotated_file_pages_by_system 0 -freed_pages_by_system 0 -freed_anon_pages_by_system 0 -freed_file_pages_by_system 0 -elapsed_ns_by_system 0 -scanned_pages_by_limit_under_hierarchy 9471864 -scanned_anon_pages_by_limit_under_hierarchy 6640629 -scanned_file_pages_by_limit_under_hierarchy 2831235 -rotated_pages_by_limit_under_hierarchy 4243974 -rotated_anon_pages_by_limit_under_hierarchy 3971968 -rotated_file_pages_by_limit_under_hierarchy 272006 -freed_pages_by_limit_under_hierarchy 2318492 -freed_anon_pages_by_limit_under_hierarchy 962052 -freed_file_pages_by_limit_under_hierarchy 1356440 -elapsed_ns_by_limit_under_hierarchy 351386416101 -scanned_pages_by_system_under_hierarchy 0 -scanned_anon_pages_by_system_under_hierarchy 0 -scanned_file_pages_by_system_under_hierarchy 0 -rotated_pages_by_system_under_hierarchy 0 -rotated_anon_pages_by_system_under_hierarchy 0 -rotated_file_pages_by_system_under_hierarchy 0 -freed_pages_by_system_under_hierarchy 0 -freed_anon_pages_by_system_under_hierarchy 0 -freed_file_pages_by_system_under_hierarchy 0 -elapsed_ns_by_system_under_hierarchy 0 - 5.3 swappiness Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only. diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 3b535db..343bd76 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -39,16 +39,6 @@ extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan, struct mem_cgroup *mem_cont, int active, int file); -struct memcg_scanrecord { - struct mem_cgroup *mem; /* scanend memory cgroup */ - struct mem_cgroup *root; /* scan target hierarchy root */ - int context; /* scanning context (see memcontrol.c) */ - unsigned long nr_scanned[2]; /* the number of scanned pages */ - unsigned long nr_rotated[2]; /* the number of rotated pages */ - unsigned long nr_freed[2]; /* the number of freed pages */ - unsigned long elapsed; /* nsec of time elapsed while scanning */ -}; - #ifdef CONFIG_CGROUP_MEM_RES_CTLR /* * All "charge" functions with gfp_mask should use GFP_KERNEL or @@ -127,15 +117,6 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page); extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); -extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem, - gfp_t gfp_mask, bool noswap, - struct memcg_scanrecord *rec); -extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, - gfp_t gfp_mask, bool noswap, - struct zone *zone, - struct memcg_scanrecord *rec, - unsigned long *nr_scanned); - #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP extern int do_swap_account; #endif diff --git a/include/linux/swap.h b/include/linux/swap.h index 14d6249..c71f84b 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -252,6 +252,12 @@ static inline void lru_cache_add_file(struct page *page) extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *mask); extern int __isolate_lru_page(struct page *page, int mode, int file); +extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem, + gfp_t gfp_mask, bool noswap); +extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, + gfp_t gfp_mask, bool noswap, + struct zone *zone, + unsigned long *nr_scanned); extern unsigned long shrink_all_memory(unsigned long nr_pages); extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ebd1e86..3508777 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -204,50 +204,6 @@ struct mem_cgroup_eventfd_list { static void mem_cgroup_threshold(struct mem_cgroup *mem); static void mem_cgroup_oom_notify(struct mem_cgroup *mem); -enum { - SCAN_BY_LIMIT, - SCAN_BY_SYSTEM, - NR_SCAN_CONTEXT, - SCAN_BY_SHRINK, /* not recorded now */ -}; - -enum { - SCAN, - SCAN_ANON, - SCAN_FILE, - ROTATE, - ROTATE_ANON, - ROTATE_FILE, - FREED, - FREED_ANON, - FREED_FILE, - ELAPSED, - NR_SCANSTATS, -}; - -struct scanstat { - spinlock_t lock; - unsigned long stats[NR_SCAN_CONTEXT][NR_SCANSTATS]; - unsigned long rootstats[NR_SCAN_CONTEXT][NR_SCANSTATS]; -}; - -const char *scanstat_string[NR_SCANSTATS] = { - "scanned_pages", - "scanned_anon_pages", - "scanned_file_pages", - "rotated_pages", - "rotated_anon_pages", - "rotated_file_pages", - "freed_pages", - "freed_anon_pages", - "freed_file_pages", - "elapsed_ns", -}; -#define SCANSTAT_WORD_LIMIT "_by_limit" -#define SCANSTAT_WORD_SYSTEM "_by_system" -#define SCANSTAT_WORD_HIERARCHY "_under_hierarchy" - - /* * The memory controller data structure. The memory controller controls both * page cache and RSS per cgroup. We would eventually like to provide @@ -313,8 +269,7 @@ struct mem_cgroup { /* For oom notifier event fd */ struct list_head oom_notify; - /* For recording LRU-scan statistics */ - struct scanstat scanstat; + /* * Should we move charges of a task when a task is moved into this * mem_cgroup ? And what type of charges should we move ? @@ -1678,44 +1633,6 @@ bool mem_cgroup_reclaimable(struct mem_cgroup *mem, bool noswap) } #endif -static void __mem_cgroup_record_scanstat(unsigned long *stats, - struct memcg_scanrecord *rec) -{ - - stats[SCAN] += rec->nr_scanned[0] + rec->nr_scanned[1]; - stats[SCAN_ANON] += rec->nr_scanned[0]; - stats[SCAN_FILE] += rec->nr_scanned[1]; - - stats[ROTATE] += rec->nr_rotated[0] + rec->nr_rotated[1]; - stats[ROTATE_ANON] += rec->nr_rotated[0]; - stats[ROTATE_FILE] += rec->nr_rotated[1]; - - stats[FREED] += rec->nr_freed[0] + rec->nr_freed[1]; - stats[FREED_ANON] += rec->nr_freed[0]; - stats[FREED_FILE] += rec->nr_freed[1]; - - stats[ELAPSED] += rec->elapsed; -} - -static void mem_cgroup_record_scanstat(struct memcg_scanrecord *rec) -{ - struct mem_cgroup *mem; - int context = rec->context; - - if (context >= NR_SCAN_CONTEXT) - return; - - mem = rec->mem; - spin_lock(&mem->scanstat.lock); - __mem_cgroup_record_scanstat(mem->scanstat.stats[context], rec); - spin_unlock(&mem->scanstat.lock); - - mem = rec->root; - spin_lock(&mem->scanstat.lock); - __mem_cgroup_record_scanstat(mem->scanstat.rootstats[context], rec); - spin_unlock(&mem->scanstat.lock); -} - /* * Scan the hierarchy if needed to reclaim memory. We remember the last child * we reclaimed from, so that we don't end up penalizing one child extensively @@ -1740,9 +1657,8 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem, bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP; bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK; bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT; - struct memcg_scanrecord rec; unsigned long excess; - unsigned long scanned; + unsigned long nr_scanned; excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT; @@ -1750,15 +1666,6 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem, if (!check_soft && !shrink && root_mem->memsw_is_minimum) noswap = true; - if (shrink) - rec.context = SCAN_BY_SHRINK; - else if (check_soft) - rec.context = SCAN_BY_SYSTEM; - else - rec.context = SCAN_BY_LIMIT; - - rec.root = root_mem; - while (1) { victim = mem_cgroup_select_victim(root_mem); if (victim == root_mem) { @@ -1799,23 +1706,14 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem, css_put(&victim->css); continue; } - rec.mem = victim; - rec.nr_scanned[0] = 0; - rec.nr_scanned[1] = 0; - rec.nr_rotated[0] = 0; - rec.nr_rotated[1] = 0; - rec.nr_freed[0] = 0; - rec.nr_freed[1] = 0; - rec.elapsed = 0; /* we use swappiness of local cgroup */ if (check_soft) { ret = mem_cgroup_shrink_node_zone(victim, gfp_mask, - noswap, zone, &rec, &scanned); - *total_scanned += scanned; + noswap, zone, &nr_scanned); + *total_scanned += nr_scanned; } else ret = try_to_free_mem_cgroup_pages(victim, gfp_mask, - noswap, &rec); - mem_cgroup_record_scanstat(&rec); + noswap); css_put(&victim->css); /* * At shrinking usage, we can't check we should stop here or @@ -3854,18 +3752,14 @@ try_to_free: /* try to free all pages in this cgroup */ shrink = 1; while (nr_retries && mem->res.usage > 0) { - struct memcg_scanrecord rec; int progress; if (signal_pending(current)) { ret = -EINTR; goto out; } - rec.context = SCAN_BY_SHRINK; - rec.mem = mem; - rec.root = mem; progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL, - false, &rec); + false); if (!progress) { nr_retries--; /* maybe some writeback is necessary */ @@ -4709,54 +4603,6 @@ static int mem_control_numa_stat_open(struct inode *unused, struct file *file) } #endif /* CONFIG_NUMA */ -static int mem_cgroup_vmscan_stat_read(struct cgroup *cgrp, - struct cftype *cft, - struct cgroup_map_cb *cb) -{ - struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp); - char string[64]; - int i; - - for (i = 0; i < NR_SCANSTATS; i++) { - strcpy(string, scanstat_string[i]); - strcat(string, SCANSTAT_WORD_LIMIT); - cb->fill(cb, string, mem->scanstat.stats[SCAN_BY_LIMIT][i]); - } - - for (i = 0; i < NR_SCANSTATS; i++) { - strcpy(string, scanstat_string[i]); - strcat(string, SCANSTAT_WORD_SYSTEM); - cb->fill(cb, string, mem->scanstat.stats[SCAN_BY_SYSTEM][i]); - } - - for (i = 0; i < NR_SCANSTATS; i++) { - strcpy(string, scanstat_string[i]); - strcat(string, SCANSTAT_WORD_LIMIT); - strcat(string, SCANSTAT_WORD_HIERARCHY); - cb->fill(cb, string, mem->scanstat.rootstats[SCAN_BY_LIMIT][i]); - } - for (i = 0; i < NR_SCANSTATS; i++) { - strcpy(string, scanstat_string[i]); - strcat(string, SCANSTAT_WORD_SYSTEM); - strcat(string, SCANSTAT_WORD_HIERARCHY); - cb->fill(cb, string, mem->scanstat.rootstats[SCAN_BY_SYSTEM][i]); - } - return 0; -} - -static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp, - unsigned int event) -{ - struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp); - - spin_lock(&mem->scanstat.lock); - memset(&mem->scanstat.stats, 0, sizeof(mem->scanstat.stats)); - memset(&mem->scanstat.rootstats, 0, sizeof(mem->scanstat.rootstats)); - spin_unlock(&mem->scanstat.lock); - return 0; -} - - static struct cftype mem_cgroup_files[] = { { .name = "usage_in_bytes", @@ -4827,11 +4673,6 @@ static struct cftype mem_cgroup_files[] = { .mode = S_IRUGO, }, #endif - { - .name = "vmscan_stat", - .read_map = mem_cgroup_vmscan_stat_read, - .trigger = mem_cgroup_reset_vmscan_stat, - }, }; #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP @@ -5095,7 +4936,6 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) atomic_set(&mem->refcnt, 1); mem->move_charge_at_immigrate = 0; mutex_init(&mem->thresholds_lock); - spin_lock_init(&mem->scanstat.lock); return &mem->css; free_out: __mem_cgroup_free(mem); diff --git a/mm/vmscan.c b/mm/vmscan.c index 04bb6ae..6588746 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -105,7 +105,6 @@ struct scan_control { /* Which cgroup do we reclaim from */ struct mem_cgroup *mem_cgroup; - struct memcg_scanrecord *memcg_record; /* * Nodemask of nodes allowed by the caller. If NULL, all nodes @@ -1349,8 +1348,6 @@ putback_lru_pages(struct zone *zone, struct scan_control *sc, int file = is_file_lru(lru); int numpages = hpage_nr_pages(page); reclaim_stat->recent_rotated[file] += numpages; - if (!scanning_global_lru(sc)) - sc->memcg_record->nr_rotated[file] += numpages; } if (!pagevec_add(&pvec, page)) { spin_unlock_irq(&zone->lru_lock); @@ -1394,10 +1391,6 @@ static noinline_for_stack void update_isolated_counts(struct zone *zone, reclaim_stat->recent_scanned[0] += *nr_anon; reclaim_stat->recent_scanned[1] += *nr_file; - if (!scanning_global_lru(sc)) { - sc->memcg_record->nr_scanned[0] += *nr_anon; - sc->memcg_record->nr_scanned[1] += *nr_file; - } } /* @@ -1511,9 +1504,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, nr_reclaimed += shrink_page_list(&page_list, zone, sc); } - if (!scanning_global_lru(sc)) - sc->memcg_record->nr_freed[file] += nr_reclaimed; - local_irq_disable(); if (current_is_kswapd()) __count_vm_events(KSWAPD_STEAL, nr_reclaimed); @@ -1613,8 +1603,6 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, } reclaim_stat->recent_scanned[file] += nr_taken; - if (!scanning_global_lru(sc)) - sc->memcg_record->nr_scanned[file] += nr_taken; __count_zone_vm_events(PGREFILL, zone, pgscanned); if (file) @@ -1666,8 +1654,6 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, * get_scan_ratio. */ reclaim_stat->recent_rotated[file] += nr_rotated; - if (!scanning_global_lru(sc)) - sc->memcg_record->nr_rotated[file] += nr_rotated; move_active_pages_to_lru(zone, &l_active, LRU_ACTIVE + file * LRU_FILE); @@ -2253,10 +2239,9 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, #ifdef CONFIG_CGROUP_MEM_RES_CTLR unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, - gfp_t gfp_mask, bool noswap, - struct zone *zone, - struct memcg_scanrecord *rec, - unsigned long *scanned) + gfp_t gfp_mask, bool noswap, + struct zone *zone, + unsigned long *nr_scanned) { struct scan_control sc = { .nr_scanned = 0, @@ -2266,9 +2251,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, .may_swap = !noswap, .order = 0, .mem_cgroup = mem, - .memcg_record = rec, }; - ktime_t start, end; sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK); @@ -2277,7 +2260,6 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, sc.may_writepage, sc.gfp_mask); - start = ktime_get(); /* * NOTE: Although we can get the priority field, using it * here is not a good idea, since it limits the pages we can scan. @@ -2286,25 +2268,19 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, * the priority and make it zero. */ shrink_zone(0, zone, &sc); - end = ktime_get(); - - if (rec) - rec->elapsed += ktime_to_ns(ktime_sub(end, start)); - *scanned = sc.nr_scanned; trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed); + *nr_scanned = sc.nr_scanned; return sc.nr_reclaimed; } unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont, gfp_t gfp_mask, - bool noswap, - struct memcg_scanrecord *rec) + bool noswap) { struct zonelist *zonelist; unsigned long nr_reclaimed; - ktime_t start, end; int nid; struct scan_control sc = { .may_writepage = !laptop_mode, @@ -2313,7 +2289,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont, .nr_to_reclaim = SWAP_CLUSTER_MAX, .order = 0, .mem_cgroup = mem_cont, - .memcg_record = rec, .nodemask = NULL, /* we don't care the placement */ .gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK), @@ -2322,7 +2297,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont, .gfp_mask = sc.gfp_mask, }; - start = ktime_get(); /* * Unlike direct reclaim via alloc_pages(), memcg's reclaim doesn't * take care of from where we get pages. So the node where we start the @@ -2337,9 +2311,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont, sc.gfp_mask); nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink); - end = ktime_get(); - if (rec) - rec->elapsed += ktime_to_ns(ktime_sub(end, start)); trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed); -- 1.7.6 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" 2011-08-29 15:51 ` Johannes Weiner @ 2011-08-30 1:12 ` KAMEZAWA Hiroyuki -1 siblings, 0 replies; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-08-30 1:12 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Mon, 29 Aug 2011 17:51:13 +0200 Johannes Weiner <jweiner@redhat.com> wrote: > On Tue, Aug 09, 2011 at 08:33:45AM +0900, KAMEZAWA Hiroyuki wrote: > > On Mon, 8 Aug 2011 14:43:33 +0200 > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > On Fri, Jul 22, 2011 at 05:15:40PM +0900, KAMEZAWA Hiroyuki wrote: > > > > +When under_hierarchy is added in the tail, the number indicates the > > > > +total memcg scan of its children and itself. > > > > > > In your implementation, statistics are only accounted to the memcg > > > triggering the limit and the respectively scanned memcgs. > > > > > > Consider the following setup: > > > > > > A > > > / \ > > > B C > > > / > > > D > > > > > > If D tries to charge but hits the limit of A, then B's hierarchy > > > counters do not reflect the reclaim activity resulting in D. > > > > > yes, as I expected. > > Andrew, > > with a flawed design, the author unwilling to fix it, and two NAKs, > can we please revert this before the release? > How about this ? == Now, vmscan_stat's hierarchy counter just counts scan data which is caused by the owner of limits. Then, it's not 'hierarchical' as other parts of memcg does. For example, Assuming following hierarchy A / B / C When B,C, is scanned because of A's limit, vmscan_stat's hierarchy accounting does A's hierarchy scan = A'scan + B'scan + C'scan B's hierarchy scan = 0 C's hierarchy scan = 0 This first design was because the author considered C's scan is caused by A. But considering interface compatibility, following is natural. A's hierarchy scan = A'scan + B'scan + C'scan B's hierarchy scan = B'scan + C'scan C's hierarchy scan = C'scan This patch changes counting implementation. Suggested-by: Johannes Weiner <jweiner@redhat.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> --- mm/memcontrol.c | 28 ++++++++++++++++++---------- 1 file changed, 18 insertions(+), 10 deletions(-) Index: mmotm-Aug29/mm/memcontrol.c =================================================================== --- mmotm-Aug29.orig/mm/memcontrol.c +++ mmotm-Aug29/mm/memcontrol.c @@ -229,7 +229,7 @@ enum { struct scanstat { spinlock_t lock; unsigned long stats[NR_SCAN_CONTEXT][NR_SCANSTATS]; - unsigned long rootstats[NR_SCAN_CONTEXT][NR_SCANSTATS]; + unsigned long hierarchy_stats[NR_SCAN_CONTEXT][NR_SCANSTATS]; }; const char *scanstat_string[NR_SCANSTATS] = { @@ -1701,6 +1701,7 @@ static void __mem_cgroup_record_scanstat static void mem_cgroup_record_scanstat(struct memcg_scanrecord *rec) { struct mem_cgroup *memcg; + struct cgroup *cgroup; int context = rec->context; if (context >= NR_SCAN_CONTEXT) @@ -1710,11 +1711,18 @@ static void mem_cgroup_record_scanstat(s spin_lock(&memcg->scanstat.lock); __mem_cgroup_record_scanstat(memcg->scanstat.stats[context], rec); spin_unlock(&memcg->scanstat.lock); - - memcg = rec->root; - spin_lock(&memcg->scanstat.lock); - __mem_cgroup_record_scanstat(memcg->scanstat.rootstats[context], rec); - spin_unlock(&memcg->scanstat.lock); + cgroup = memcg->css.cgroup; + do { + spin_lock(&memcg->scanstat.lock); + __mem_cgroup_record_scanstat( + memcg->scanstat.hierarchy_stats[context], rec); + spin_unlock(&memcg->scanstat.lock); + if (!cgroup->parent) + break; + cgroup = cgroup->parent; + memcg = mem_cgroup_from_cont(cgroup); + } while (memcg->use_hierarchy && memcg != rec->root); + return; } /* @@ -4733,14 +4741,14 @@ static int mem_cgroup_vmscan_stat_read(s strcat(string, SCANSTAT_WORD_LIMIT); strcat(string, SCANSTAT_WORD_HIERARCHY); cb->fill(cb, - string, memcg->scanstat.rootstats[SCAN_BY_LIMIT][i]); + string, memcg->scanstat.hierarchy_stats[SCAN_BY_LIMIT][i]); } for (i = 0; i < NR_SCANSTATS; i++) { strcpy(string, scanstat_string[i]); strcat(string, SCANSTAT_WORD_SYSTEM); strcat(string, SCANSTAT_WORD_HIERARCHY); cb->fill(cb, - string, memcg->scanstat.rootstats[SCAN_BY_SYSTEM][i]); + string, memcg->scanstat.hierarchy_stats[SCAN_BY_SYSTEM][i]); } return 0; } @@ -4752,8 +4760,8 @@ static int mem_cgroup_reset_vmscan_stat( spin_lock(&memcg->scanstat.lock); memset(&memcg->scanstat.stats, 0, sizeof(memcg->scanstat.stats)); - memset(&memcg->scanstat.rootstats, - 0, sizeof(memcg->scanstat.rootstats)); + memset(&memcg->scanstat.hierarchy_stats, + 0, sizeof(memcg->scanstat.hierarchy_stats)); spin_unlock(&memcg->scanstat.lock); return 0; } ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" @ 2011-08-30 1:12 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-08-30 1:12 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Mon, 29 Aug 2011 17:51:13 +0200 Johannes Weiner <jweiner@redhat.com> wrote: > On Tue, Aug 09, 2011 at 08:33:45AM +0900, KAMEZAWA Hiroyuki wrote: > > On Mon, 8 Aug 2011 14:43:33 +0200 > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > On Fri, Jul 22, 2011 at 05:15:40PM +0900, KAMEZAWA Hiroyuki wrote: > > > > +When under_hierarchy is added in the tail, the number indicates the > > > > +total memcg scan of its children and itself. > > > > > > In your implementation, statistics are only accounted to the memcg > > > triggering the limit and the respectively scanned memcgs. > > > > > > Consider the following setup: > > > > > > A > > > / \ > > > B C > > > / > > > D > > > > > > If D tries to charge but hits the limit of A, then B's hierarchy > > > counters do not reflect the reclaim activity resulting in D. > > > > > yes, as I expected. > > Andrew, > > with a flawed design, the author unwilling to fix it, and two NAKs, > can we please revert this before the release? > How about this ? == Now, vmscan_stat's hierarchy counter just counts scan data which is caused by the owner of limits. Then, it's not 'hierarchical' as other parts of memcg does. For example, Assuming following hierarchy A / B / C When B,C, is scanned because of A's limit, vmscan_stat's hierarchy accounting does A's hierarchy scan = A'scan + B'scan + C'scan B's hierarchy scan = 0 C's hierarchy scan = 0 This first design was because the author considered C's scan is caused by A. But considering interface compatibility, following is natural. A's hierarchy scan = A'scan + B'scan + C'scan B's hierarchy scan = B'scan + C'scan C's hierarchy scan = C'scan This patch changes counting implementation. Suggested-by: Johannes Weiner <jweiner@redhat.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> --- mm/memcontrol.c | 28 ++++++++++++++++++---------- 1 file changed, 18 insertions(+), 10 deletions(-) Index: mmotm-Aug29/mm/memcontrol.c =================================================================== --- mmotm-Aug29.orig/mm/memcontrol.c +++ mmotm-Aug29/mm/memcontrol.c @@ -229,7 +229,7 @@ enum { struct scanstat { spinlock_t lock; unsigned long stats[NR_SCAN_CONTEXT][NR_SCANSTATS]; - unsigned long rootstats[NR_SCAN_CONTEXT][NR_SCANSTATS]; + unsigned long hierarchy_stats[NR_SCAN_CONTEXT][NR_SCANSTATS]; }; const char *scanstat_string[NR_SCANSTATS] = { @@ -1701,6 +1701,7 @@ static void __mem_cgroup_record_scanstat static void mem_cgroup_record_scanstat(struct memcg_scanrecord *rec) { struct mem_cgroup *memcg; + struct cgroup *cgroup; int context = rec->context; if (context >= NR_SCAN_CONTEXT) @@ -1710,11 +1711,18 @@ static void mem_cgroup_record_scanstat(s spin_lock(&memcg->scanstat.lock); __mem_cgroup_record_scanstat(memcg->scanstat.stats[context], rec); spin_unlock(&memcg->scanstat.lock); - - memcg = rec->root; - spin_lock(&memcg->scanstat.lock); - __mem_cgroup_record_scanstat(memcg->scanstat.rootstats[context], rec); - spin_unlock(&memcg->scanstat.lock); + cgroup = memcg->css.cgroup; + do { + spin_lock(&memcg->scanstat.lock); + __mem_cgroup_record_scanstat( + memcg->scanstat.hierarchy_stats[context], rec); + spin_unlock(&memcg->scanstat.lock); + if (!cgroup->parent) + break; + cgroup = cgroup->parent; + memcg = mem_cgroup_from_cont(cgroup); + } while (memcg->use_hierarchy && memcg != rec->root); + return; } /* @@ -4733,14 +4741,14 @@ static int mem_cgroup_vmscan_stat_read(s strcat(string, SCANSTAT_WORD_LIMIT); strcat(string, SCANSTAT_WORD_HIERARCHY); cb->fill(cb, - string, memcg->scanstat.rootstats[SCAN_BY_LIMIT][i]); + string, memcg->scanstat.hierarchy_stats[SCAN_BY_LIMIT][i]); } for (i = 0; i < NR_SCANSTATS; i++) { strcpy(string, scanstat_string[i]); strcat(string, SCANSTAT_WORD_SYSTEM); strcat(string, SCANSTAT_WORD_HIERARCHY); cb->fill(cb, - string, memcg->scanstat.rootstats[SCAN_BY_SYSTEM][i]); + string, memcg->scanstat.hierarchy_stats[SCAN_BY_SYSTEM][i]); } return 0; } @@ -4752,8 +4760,8 @@ static int mem_cgroup_reset_vmscan_stat( spin_lock(&memcg->scanstat.lock); memset(&memcg->scanstat.stats, 0, sizeof(memcg->scanstat.stats)); - memset(&memcg->scanstat.rootstats, - 0, sizeof(memcg->scanstat.rootstats)); + memset(&memcg->scanstat.hierarchy_stats, + 0, sizeof(memcg->scanstat.hierarchy_stats)); spin_unlock(&memcg->scanstat.lock); return 0; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" 2011-08-30 1:12 ` KAMEZAWA Hiroyuki @ 2011-08-30 7:04 ` Johannes Weiner -1 siblings, 0 replies; 54+ messages in thread From: Johannes Weiner @ 2011-08-30 7:04 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Tue, Aug 30, 2011 at 10:12:33AM +0900, KAMEZAWA Hiroyuki wrote: > On Mon, 29 Aug 2011 17:51:13 +0200 > Johannes Weiner <jweiner@redhat.com> wrote: > > > On Tue, Aug 09, 2011 at 08:33:45AM +0900, KAMEZAWA Hiroyuki wrote: > > > On Mon, 8 Aug 2011 14:43:33 +0200 > > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > > > On Fri, Jul 22, 2011 at 05:15:40PM +0900, KAMEZAWA Hiroyuki wrote: > > > > > +When under_hierarchy is added in the tail, the number indicates the > > > > > +total memcg scan of its children and itself. > > > > > > > > In your implementation, statistics are only accounted to the memcg > > > > triggering the limit and the respectively scanned memcgs. > > > > > > > > Consider the following setup: > > > > > > > > A > > > > / \ > > > > B C > > > > / > > > > D > > > > > > > > If D tries to charge but hits the limit of A, then B's hierarchy > > > > counters do not reflect the reclaim activity resulting in D. > > > > > > > yes, as I expected. > > > > Andrew, > > > > with a flawed design, the author unwilling to fix it, and two NAKs, > > can we please revert this before the release? > > How about this ? > @@ -1710,11 +1711,18 @@ static void mem_cgroup_record_scanstat(s > spin_lock(&memcg->scanstat.lock); > __mem_cgroup_record_scanstat(memcg->scanstat.stats[context], rec); > spin_unlock(&memcg->scanstat.lock); > - > - memcg = rec->root; > - spin_lock(&memcg->scanstat.lock); > - __mem_cgroup_record_scanstat(memcg->scanstat.rootstats[context], rec); > - spin_unlock(&memcg->scanstat.lock); > + cgroup = memcg->css.cgroup; > + do { > + spin_lock(&memcg->scanstat.lock); > + __mem_cgroup_record_scanstat( > + memcg->scanstat.hierarchy_stats[context], rec); > + spin_unlock(&memcg->scanstat.lock); > + if (!cgroup->parent) > + break; > + cgroup = cgroup->parent; > + memcg = mem_cgroup_from_cont(cgroup); > + } while (memcg->use_hierarchy && memcg != rec->root); Okay, so this looks correct, but it sums up all parents after each memcg scanned, which could have a performance impact. Usually, hierarchy statistics are only summed up when a user reads them. I don't get why this has to be done completely different from the way we usually do things, without any justification, whatsoever. Why do you want to pass a recording structure down the reclaim stack? Why not make it per-cpu counters that are only summed up, together with the hierarchy values, when someone is actually interested in them? With an interface like mem_cgroup_count_vm_event(), or maybe even an extension of that function? ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" @ 2011-08-30 7:04 ` Johannes Weiner 0 siblings, 0 replies; 54+ messages in thread From: Johannes Weiner @ 2011-08-30 7:04 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Tue, Aug 30, 2011 at 10:12:33AM +0900, KAMEZAWA Hiroyuki wrote: > On Mon, 29 Aug 2011 17:51:13 +0200 > Johannes Weiner <jweiner@redhat.com> wrote: > > > On Tue, Aug 09, 2011 at 08:33:45AM +0900, KAMEZAWA Hiroyuki wrote: > > > On Mon, 8 Aug 2011 14:43:33 +0200 > > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > > > On Fri, Jul 22, 2011 at 05:15:40PM +0900, KAMEZAWA Hiroyuki wrote: > > > > > +When under_hierarchy is added in the tail, the number indicates the > > > > > +total memcg scan of its children and itself. > > > > > > > > In your implementation, statistics are only accounted to the memcg > > > > triggering the limit and the respectively scanned memcgs. > > > > > > > > Consider the following setup: > > > > > > > > A > > > > / \ > > > > B C > > > > / > > > > D > > > > > > > > If D tries to charge but hits the limit of A, then B's hierarchy > > > > counters do not reflect the reclaim activity resulting in D. > > > > > > > yes, as I expected. > > > > Andrew, > > > > with a flawed design, the author unwilling to fix it, and two NAKs, > > can we please revert this before the release? > > How about this ? > @@ -1710,11 +1711,18 @@ static void mem_cgroup_record_scanstat(s > spin_lock(&memcg->scanstat.lock); > __mem_cgroup_record_scanstat(memcg->scanstat.stats[context], rec); > spin_unlock(&memcg->scanstat.lock); > - > - memcg = rec->root; > - spin_lock(&memcg->scanstat.lock); > - __mem_cgroup_record_scanstat(memcg->scanstat.rootstats[context], rec); > - spin_unlock(&memcg->scanstat.lock); > + cgroup = memcg->css.cgroup; > + do { > + spin_lock(&memcg->scanstat.lock); > + __mem_cgroup_record_scanstat( > + memcg->scanstat.hierarchy_stats[context], rec); > + spin_unlock(&memcg->scanstat.lock); > + if (!cgroup->parent) > + break; > + cgroup = cgroup->parent; > + memcg = mem_cgroup_from_cont(cgroup); > + } while (memcg->use_hierarchy && memcg != rec->root); Okay, so this looks correct, but it sums up all parents after each memcg scanned, which could have a performance impact. Usually, hierarchy statistics are only summed up when a user reads them. I don't get why this has to be done completely different from the way we usually do things, without any justification, whatsoever. Why do you want to pass a recording structure down the reclaim stack? Why not make it per-cpu counters that are only summed up, together with the hierarchy values, when someone is actually interested in them? With an interface like mem_cgroup_count_vm_event(), or maybe even an extension of that function? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" 2011-08-30 7:04 ` Johannes Weiner @ 2011-08-30 7:20 ` KAMEZAWA Hiroyuki -1 siblings, 0 replies; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-08-30 7:20 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Tue, 30 Aug 2011 09:04:24 +0200 Johannes Weiner <jweiner@redhat.com> wrote: > On Tue, Aug 30, 2011 at 10:12:33AM +0900, KAMEZAWA Hiroyuki wrote: > > On Mon, 29 Aug 2011 17:51:13 +0200 > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > On Tue, Aug 09, 2011 at 08:33:45AM +0900, KAMEZAWA Hiroyuki wrote: > > > > On Mon, 8 Aug 2011 14:43:33 +0200 > > > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > > > > > On Fri, Jul 22, 2011 at 05:15:40PM +0900, KAMEZAWA Hiroyuki wrote: > > > > > > +When under_hierarchy is added in the tail, the number indicates the > > > > > > +total memcg scan of its children and itself. > > > > > > > > > > In your implementation, statistics are only accounted to the memcg > > > > > triggering the limit and the respectively scanned memcgs. > > > > > > > > > > Consider the following setup: > > > > > > > > > > A > > > > > / \ > > > > > B C > > > > > / > > > > > D > > > > > > > > > > If D tries to charge but hits the limit of A, then B's hierarchy > > > > > counters do not reflect the reclaim activity resulting in D. > > > > > > > > > yes, as I expected. > > > > > > Andrew, > > > > > > with a flawed design, the author unwilling to fix it, and two NAKs, > > > can we please revert this before the release? > > > > How about this ? > > > @@ -1710,11 +1711,18 @@ static void mem_cgroup_record_scanstat(s > > spin_lock(&memcg->scanstat.lock); > > __mem_cgroup_record_scanstat(memcg->scanstat.stats[context], rec); > > spin_unlock(&memcg->scanstat.lock); > > - > > - memcg = rec->root; > > - spin_lock(&memcg->scanstat.lock); > > - __mem_cgroup_record_scanstat(memcg->scanstat.rootstats[context], rec); > > - spin_unlock(&memcg->scanstat.lock); > > + cgroup = memcg->css.cgroup; > > + do { > > + spin_lock(&memcg->scanstat.lock); > > + __mem_cgroup_record_scanstat( > > + memcg->scanstat.hierarchy_stats[context], rec); > > + spin_unlock(&memcg->scanstat.lock); > > + if (!cgroup->parent) > > + break; > > + cgroup = cgroup->parent; > > + memcg = mem_cgroup_from_cont(cgroup); > > + } while (memcg->use_hierarchy && memcg != rec->root); > > Okay, so this looks correct, but it sums up all parents after each > memcg scanned, which could have a performance impact. Usually, > hierarchy statistics are only summed up when a user reads them. > Hmm. But sum-at-read doesn't work. Assume 3 cgroups in a hierarchy. A / B / C C's scan contains 3 causes. C's scan caused by limit of A. C's scan caused by limit of B. C's scan caused by limit of C. If we make hierarchy sum at read, we think B's scan_stat = B's scan_stat + C's scan_stat But in precice, this is B's scan_stat = B's scan_stat caused by B + B's scan_stat caused by A + C's scan_stat caused by C + C's scan_stat caused by B + C's scan_stat caused by A. In orignal version. B's scan_stat = B's scan_stat caused by B + C's scan_stat caused by B + After this patch, B's scan_stat = B's scan_stat caused by B + B's scan_stat caused by A + C's scan_stat caused by C + C's scan_stat caused by B + C's scan_stat caused by A. Hmm...removing hierarchy part completely seems fine to me. > I don't get why this has to be done completely different from the way > we usually do things, without any justification, whatsoever. > > Why do you want to pass a recording structure down the reclaim stack? Just for reducing number of passed variables. > Why not make it per-cpu counters that are only summed up, together > with the hierarchy values, when someone is actually interested in > them? With an interface like mem_cgroup_count_vm_event(), or maybe > even an extension of that function? percpu counter seems overkill to me because there is no heavy lock contention. Thanks, -Kame ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" @ 2011-08-30 7:20 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-08-30 7:20 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Tue, 30 Aug 2011 09:04:24 +0200 Johannes Weiner <jweiner@redhat.com> wrote: > On Tue, Aug 30, 2011 at 10:12:33AM +0900, KAMEZAWA Hiroyuki wrote: > > On Mon, 29 Aug 2011 17:51:13 +0200 > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > On Tue, Aug 09, 2011 at 08:33:45AM +0900, KAMEZAWA Hiroyuki wrote: > > > > On Mon, 8 Aug 2011 14:43:33 +0200 > > > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > > > > > On Fri, Jul 22, 2011 at 05:15:40PM +0900, KAMEZAWA Hiroyuki wrote: > > > > > > +When under_hierarchy is added in the tail, the number indicates the > > > > > > +total memcg scan of its children and itself. > > > > > > > > > > In your implementation, statistics are only accounted to the memcg > > > > > triggering the limit and the respectively scanned memcgs. > > > > > > > > > > Consider the following setup: > > > > > > > > > > A > > > > > / \ > > > > > B C > > > > > / > > > > > D > > > > > > > > > > If D tries to charge but hits the limit of A, then B's hierarchy > > > > > counters do not reflect the reclaim activity resulting in D. > > > > > > > > > yes, as I expected. > > > > > > Andrew, > > > > > > with a flawed design, the author unwilling to fix it, and two NAKs, > > > can we please revert this before the release? > > > > How about this ? > > > @@ -1710,11 +1711,18 @@ static void mem_cgroup_record_scanstat(s > > spin_lock(&memcg->scanstat.lock); > > __mem_cgroup_record_scanstat(memcg->scanstat.stats[context], rec); > > spin_unlock(&memcg->scanstat.lock); > > - > > - memcg = rec->root; > > - spin_lock(&memcg->scanstat.lock); > > - __mem_cgroup_record_scanstat(memcg->scanstat.rootstats[context], rec); > > - spin_unlock(&memcg->scanstat.lock); > > + cgroup = memcg->css.cgroup; > > + do { > > + spin_lock(&memcg->scanstat.lock); > > + __mem_cgroup_record_scanstat( > > + memcg->scanstat.hierarchy_stats[context], rec); > > + spin_unlock(&memcg->scanstat.lock); > > + if (!cgroup->parent) > > + break; > > + cgroup = cgroup->parent; > > + memcg = mem_cgroup_from_cont(cgroup); > > + } while (memcg->use_hierarchy && memcg != rec->root); > > Okay, so this looks correct, but it sums up all parents after each > memcg scanned, which could have a performance impact. Usually, > hierarchy statistics are only summed up when a user reads them. > Hmm. But sum-at-read doesn't work. Assume 3 cgroups in a hierarchy. A / B / C C's scan contains 3 causes. C's scan caused by limit of A. C's scan caused by limit of B. C's scan caused by limit of C. If we make hierarchy sum at read, we think B's scan_stat = B's scan_stat + C's scan_stat But in precice, this is B's scan_stat = B's scan_stat caused by B + B's scan_stat caused by A + C's scan_stat caused by C + C's scan_stat caused by B + C's scan_stat caused by A. In orignal version. B's scan_stat = B's scan_stat caused by B + C's scan_stat caused by B + After this patch, B's scan_stat = B's scan_stat caused by B + B's scan_stat caused by A + C's scan_stat caused by C + C's scan_stat caused by B + C's scan_stat caused by A. Hmm...removing hierarchy part completely seems fine to me. > I don't get why this has to be done completely different from the way > we usually do things, without any justification, whatsoever. > > Why do you want to pass a recording structure down the reclaim stack? Just for reducing number of passed variables. > Why not make it per-cpu counters that are only summed up, together > with the hierarchy values, when someone is actually interested in > them? With an interface like mem_cgroup_count_vm_event(), or maybe > even an extension of that function? percpu counter seems overkill to me because there is no heavy lock contention. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" 2011-08-30 7:20 ` KAMEZAWA Hiroyuki @ 2011-08-30 7:35 ` KAMEZAWA Hiroyuki -1 siblings, 0 replies; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-08-30 7:35 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Johannes Weiner, Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Tue, 30 Aug 2011 16:20:50 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > Hmm...removing hierarchy part completely seems fine to me. > Another idea here. == Revert hierarchy support in vmscan_stat. It turns out to be further study/use-case is required. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> --- Documentation/cgroups/memory.txt | 27 ++------------------------- include/linux/memcontrol.h | 1 - mm/memcontrol.c | 25 ------------------------- 3 files changed, 2 insertions(+), 51 deletions(-) Index: mmotm-Aug29/Documentation/cgroups/memory.txt =================================================================== --- mmotm-Aug29.orig/Documentation/cgroups/memory.txt +++ mmotm-Aug29/Documentation/cgroups/memory.txt @@ -448,8 +448,8 @@ memory cgroup creation and can be reset This file contains following statistics. -[param]_[file_or_anon]_pages_by_[reason]_[under_heararchy] -[param]_elapsed_ns_by_[reason]_[under_hierarchy] +[param]_[file_or_anon]_pages_by_[reason] +[param]_elapsed_ns_by_[reason] For example, @@ -470,9 +470,6 @@ Now, 2 reason are supported system - global memory pressure + softlimit (global memory pressure not under softlimit is not handled now) -When under_hierarchy is added in the tail, the number indicates the -total memcg scan of its children and itself. - elapsed_ns is a elapsed time in nanosecond. This may include sleep time and not indicates CPU usage. So, please take this as just showing latency. @@ -500,26 +497,6 @@ freed_pages_by_system 0 freed_anon_pages_by_system 0 freed_file_pages_by_system 0 elapsed_ns_by_system 0 -scanned_pages_by_limit_under_hierarchy 9471864 -scanned_anon_pages_by_limit_under_hierarchy 6640629 -scanned_file_pages_by_limit_under_hierarchy 2831235 -rotated_pages_by_limit_under_hierarchy 4243974 -rotated_anon_pages_by_limit_under_hierarchy 3971968 -rotated_file_pages_by_limit_under_hierarchy 272006 -freed_pages_by_limit_under_hierarchy 2318492 -freed_anon_pages_by_limit_under_hierarchy 962052 -freed_file_pages_by_limit_under_hierarchy 1356440 -elapsed_ns_by_limit_under_hierarchy 351386416101 -scanned_pages_by_system_under_hierarchy 0 -scanned_anon_pages_by_system_under_hierarchy 0 -scanned_file_pages_by_system_under_hierarchy 0 -rotated_pages_by_system_under_hierarchy 0 -rotated_anon_pages_by_system_under_hierarchy 0 -rotated_file_pages_by_system_under_hierarchy 0 -freed_pages_by_system_under_hierarchy 0 -freed_anon_pages_by_system_under_hierarchy 0 -freed_file_pages_by_system_under_hierarchy 0 -elapsed_ns_by_system_under_hierarchy 0 5.3 swappiness Index: mmotm-Aug29/mm/memcontrol.c =================================================================== --- mmotm-Aug29.orig/mm/memcontrol.c +++ mmotm-Aug29/mm/memcontrol.c @@ -229,7 +229,6 @@ enum { struct scanstat { spinlock_t lock; unsigned long stats[NR_SCAN_CONTEXT][NR_SCANSTATS]; - unsigned long rootstats[NR_SCAN_CONTEXT][NR_SCANSTATS]; }; const char *scanstat_string[NR_SCANSTATS] = { @@ -246,7 +245,6 @@ const char *scanstat_string[NR_SCANSTATS }; #define SCANSTAT_WORD_LIMIT "_by_limit" #define SCANSTAT_WORD_SYSTEM "_by_system" -#define SCANSTAT_WORD_HIERARCHY "_under_hierarchy" /* @@ -1710,11 +1708,6 @@ static void mem_cgroup_record_scanstat(s spin_lock(&memcg->scanstat.lock); __mem_cgroup_record_scanstat(memcg->scanstat.stats[context], rec); spin_unlock(&memcg->scanstat.lock); - - memcg = rec->root; - spin_lock(&memcg->scanstat.lock); - __mem_cgroup_record_scanstat(memcg->scanstat.rootstats[context], rec); - spin_unlock(&memcg->scanstat.lock); } /* @@ -1758,8 +1751,6 @@ static int mem_cgroup_hierarchical_recla else rec.context = SCAN_BY_LIMIT; - rec.root = root_memcg; - while (1) { victim = mem_cgroup_select_victim(root_memcg); if (victim == root_memcg) { @@ -4728,20 +4719,6 @@ static int mem_cgroup_vmscan_stat_read(s cb->fill(cb, string, memcg->scanstat.stats[SCAN_BY_SYSTEM][i]); } - for (i = 0; i < NR_SCANSTATS; i++) { - strcpy(string, scanstat_string[i]); - strcat(string, SCANSTAT_WORD_LIMIT); - strcat(string, SCANSTAT_WORD_HIERARCHY); - cb->fill(cb, - string, memcg->scanstat.rootstats[SCAN_BY_LIMIT][i]); - } - for (i = 0; i < NR_SCANSTATS; i++) { - strcpy(string, scanstat_string[i]); - strcat(string, SCANSTAT_WORD_SYSTEM); - strcat(string, SCANSTAT_WORD_HIERARCHY); - cb->fill(cb, - string, memcg->scanstat.rootstats[SCAN_BY_SYSTEM][i]); - } return 0; } @@ -4752,8 +4729,6 @@ static int mem_cgroup_reset_vmscan_stat( spin_lock(&memcg->scanstat.lock); memset(&memcg->scanstat.stats, 0, sizeof(memcg->scanstat.stats)); - memset(&memcg->scanstat.rootstats, - 0, sizeof(memcg->scanstat.rootstats)); spin_unlock(&memcg->scanstat.lock); return 0; } Index: mmotm-Aug29/include/linux/memcontrol.h =================================================================== --- mmotm-Aug29.orig/include/linux/memcontrol.h +++ mmotm-Aug29/include/linux/memcontrol.h @@ -42,7 +42,6 @@ extern unsigned long mem_cgroup_isolate_ struct memcg_scanrecord { struct mem_cgroup *mem; /* scanend memory cgroup */ - struct mem_cgroup *root; /* scan target hierarchy root */ int context; /* scanning context (see memcontrol.c) */ unsigned long nr_scanned[2]; /* the number of scanned pages */ unsigned long nr_rotated[2]; /* the number of rotated pages */ ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" @ 2011-08-30 7:35 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-08-30 7:35 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Johannes Weiner, Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Tue, 30 Aug 2011 16:20:50 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > Hmm...removing hierarchy part completely seems fine to me. > Another idea here. == Revert hierarchy support in vmscan_stat. It turns out to be further study/use-case is required. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> --- Documentation/cgroups/memory.txt | 27 ++------------------------- include/linux/memcontrol.h | 1 - mm/memcontrol.c | 25 ------------------------- 3 files changed, 2 insertions(+), 51 deletions(-) Index: mmotm-Aug29/Documentation/cgroups/memory.txt =================================================================== --- mmotm-Aug29.orig/Documentation/cgroups/memory.txt +++ mmotm-Aug29/Documentation/cgroups/memory.txt @@ -448,8 +448,8 @@ memory cgroup creation and can be reset This file contains following statistics. -[param]_[file_or_anon]_pages_by_[reason]_[under_heararchy] -[param]_elapsed_ns_by_[reason]_[under_hierarchy] +[param]_[file_or_anon]_pages_by_[reason] +[param]_elapsed_ns_by_[reason] For example, @@ -470,9 +470,6 @@ Now, 2 reason are supported system - global memory pressure + softlimit (global memory pressure not under softlimit is not handled now) -When under_hierarchy is added in the tail, the number indicates the -total memcg scan of its children and itself. - elapsed_ns is a elapsed time in nanosecond. This may include sleep time and not indicates CPU usage. So, please take this as just showing latency. @@ -500,26 +497,6 @@ freed_pages_by_system 0 freed_anon_pages_by_system 0 freed_file_pages_by_system 0 elapsed_ns_by_system 0 -scanned_pages_by_limit_under_hierarchy 9471864 -scanned_anon_pages_by_limit_under_hierarchy 6640629 -scanned_file_pages_by_limit_under_hierarchy 2831235 -rotated_pages_by_limit_under_hierarchy 4243974 -rotated_anon_pages_by_limit_under_hierarchy 3971968 -rotated_file_pages_by_limit_under_hierarchy 272006 -freed_pages_by_limit_under_hierarchy 2318492 -freed_anon_pages_by_limit_under_hierarchy 962052 -freed_file_pages_by_limit_under_hierarchy 1356440 -elapsed_ns_by_limit_under_hierarchy 351386416101 -scanned_pages_by_system_under_hierarchy 0 -scanned_anon_pages_by_system_under_hierarchy 0 -scanned_file_pages_by_system_under_hierarchy 0 -rotated_pages_by_system_under_hierarchy 0 -rotated_anon_pages_by_system_under_hierarchy 0 -rotated_file_pages_by_system_under_hierarchy 0 -freed_pages_by_system_under_hierarchy 0 -freed_anon_pages_by_system_under_hierarchy 0 -freed_file_pages_by_system_under_hierarchy 0 -elapsed_ns_by_system_under_hierarchy 0 5.3 swappiness Index: mmotm-Aug29/mm/memcontrol.c =================================================================== --- mmotm-Aug29.orig/mm/memcontrol.c +++ mmotm-Aug29/mm/memcontrol.c @@ -229,7 +229,6 @@ enum { struct scanstat { spinlock_t lock; unsigned long stats[NR_SCAN_CONTEXT][NR_SCANSTATS]; - unsigned long rootstats[NR_SCAN_CONTEXT][NR_SCANSTATS]; }; const char *scanstat_string[NR_SCANSTATS] = { @@ -246,7 +245,6 @@ const char *scanstat_string[NR_SCANSTATS }; #define SCANSTAT_WORD_LIMIT "_by_limit" #define SCANSTAT_WORD_SYSTEM "_by_system" -#define SCANSTAT_WORD_HIERARCHY "_under_hierarchy" /* @@ -1710,11 +1708,6 @@ static void mem_cgroup_record_scanstat(s spin_lock(&memcg->scanstat.lock); __mem_cgroup_record_scanstat(memcg->scanstat.stats[context], rec); spin_unlock(&memcg->scanstat.lock); - - memcg = rec->root; - spin_lock(&memcg->scanstat.lock); - __mem_cgroup_record_scanstat(memcg->scanstat.rootstats[context], rec); - spin_unlock(&memcg->scanstat.lock); } /* @@ -1758,8 +1751,6 @@ static int mem_cgroup_hierarchical_recla else rec.context = SCAN_BY_LIMIT; - rec.root = root_memcg; - while (1) { victim = mem_cgroup_select_victim(root_memcg); if (victim == root_memcg) { @@ -4728,20 +4719,6 @@ static int mem_cgroup_vmscan_stat_read(s cb->fill(cb, string, memcg->scanstat.stats[SCAN_BY_SYSTEM][i]); } - for (i = 0; i < NR_SCANSTATS; i++) { - strcpy(string, scanstat_string[i]); - strcat(string, SCANSTAT_WORD_LIMIT); - strcat(string, SCANSTAT_WORD_HIERARCHY); - cb->fill(cb, - string, memcg->scanstat.rootstats[SCAN_BY_LIMIT][i]); - } - for (i = 0; i < NR_SCANSTATS; i++) { - strcpy(string, scanstat_string[i]); - strcat(string, SCANSTAT_WORD_SYSTEM); - strcat(string, SCANSTAT_WORD_HIERARCHY); - cb->fill(cb, - string, memcg->scanstat.rootstats[SCAN_BY_SYSTEM][i]); - } return 0; } @@ -4752,8 +4729,6 @@ static int mem_cgroup_reset_vmscan_stat( spin_lock(&memcg->scanstat.lock); memset(&memcg->scanstat.stats, 0, sizeof(memcg->scanstat.stats)); - memset(&memcg->scanstat.rootstats, - 0, sizeof(memcg->scanstat.rootstats)); spin_unlock(&memcg->scanstat.lock); return 0; } Index: mmotm-Aug29/include/linux/memcontrol.h =================================================================== --- mmotm-Aug29.orig/include/linux/memcontrol.h +++ mmotm-Aug29/include/linux/memcontrol.h @@ -42,7 +42,6 @@ extern unsigned long mem_cgroup_isolate_ struct memcg_scanrecord { struct mem_cgroup *mem; /* scanend memory cgroup */ - struct mem_cgroup *root; /* scan target hierarchy root */ int context; /* scanning context (see memcontrol.c) */ unsigned long nr_scanned[2]; /* the number of scanned pages */ unsigned long nr_rotated[2]; /* the number of rotated pages */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" 2011-08-30 7:20 ` KAMEZAWA Hiroyuki @ 2011-08-30 8:42 ` Johannes Weiner -1 siblings, 0 replies; 54+ messages in thread From: Johannes Weiner @ 2011-08-30 8:42 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Tue, Aug 30, 2011 at 04:20:50PM +0900, KAMEZAWA Hiroyuki wrote: > On Tue, 30 Aug 2011 09:04:24 +0200 > Johannes Weiner <jweiner@redhat.com> wrote: > > > On Tue, Aug 30, 2011 at 10:12:33AM +0900, KAMEZAWA Hiroyuki wrote: > > > @@ -1710,11 +1711,18 @@ static void mem_cgroup_record_scanstat(s > > > spin_lock(&memcg->scanstat.lock); > > > __mem_cgroup_record_scanstat(memcg->scanstat.stats[context], rec); > > > spin_unlock(&memcg->scanstat.lock); > > > - > > > - memcg = rec->root; > > > - spin_lock(&memcg->scanstat.lock); > > > - __mem_cgroup_record_scanstat(memcg->scanstat.rootstats[context], rec); > > > - spin_unlock(&memcg->scanstat.lock); > > > + cgroup = memcg->css.cgroup; > > > + do { > > > + spin_lock(&memcg->scanstat.lock); > > > + __mem_cgroup_record_scanstat( > > > + memcg->scanstat.hierarchy_stats[context], rec); > > > + spin_unlock(&memcg->scanstat.lock); > > > + if (!cgroup->parent) > > > + break; > > > + cgroup = cgroup->parent; > > > + memcg = mem_cgroup_from_cont(cgroup); > > > + } while (memcg->use_hierarchy && memcg != rec->root); > > > > Okay, so this looks correct, but it sums up all parents after each > > memcg scanned, which could have a performance impact. Usually, > > hierarchy statistics are only summed up when a user reads them. > > > Hmm. But sum-at-read doesn't work. > > Assume 3 cgroups in a hierarchy. > > A > / > B > / > C > > C's scan contains 3 causes. > C's scan caused by limit of A. > C's scan caused by limit of B. > C's scan caused by limit of C. > > If we make hierarchy sum at read, we think > B's scan_stat = B's scan_stat + C's scan_stat > But in precice, this is > > B's scan_stat = B's scan_stat caused by B + > B's scan_stat caused by A + > C's scan_stat caused by C + > C's scan_stat caused by B + > C's scan_stat caused by A. > > In orignal version. > B's scan_stat = B's scan_stat caused by B + > C's scan_stat caused by B + > > After this patch, > B's scan_stat = B's scan_stat caused by B + > B's scan_stat caused by A + > C's scan_stat caused by C + > C's scan_stat caused by B + > C's scan_stat caused by A. > > Hmm...removing hierarchy part completely seems fine to me. I see. You want to look at A and see whether its limit was responsible for reclaim scans in any children. IMO, that is asking the question backwards. Instead, there is a cgroup under reclaim and one wants to find out the cause for that. Not the other way round. In my original proposal I suggested differentiating reclaim caused by internal pressure (due to own limit) and reclaim caused by external/hierarchical pressure (due to limits from parents). If you want to find out why C is under reclaim, look at its reclaim statistics. If the _limit numbers are high, C's limit is the problem. If the _hierarchical numbers are high, the problem is B, A, or physical memory, so you check B for _limit and _hierarchical as well, then move on to A. Implementing this would be as easy as passing not only the memcg to scan (victim) to the reclaim code, but also the memcg /causing/ the reclaim (root_mem): root_mem == victim -> account to victim as _limit root_mem != victim -> account to victim as _hierarchical This would make things much simpler and more natural, both the code and the way of tracking down a problem, IMO. > > I don't get why this has to be done completely different from the way > > we usually do things, without any justification, whatsoever. > > > > Why do you want to pass a recording structure down the reclaim stack? > > Just for reducing number of passed variables. It's still sitting on bottom of the reclaim stack the whole time. With my proposal, you would only need to pass the extra root_mem pointer. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" @ 2011-08-30 8:42 ` Johannes Weiner 0 siblings, 0 replies; 54+ messages in thread From: Johannes Weiner @ 2011-08-30 8:42 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Tue, Aug 30, 2011 at 04:20:50PM +0900, KAMEZAWA Hiroyuki wrote: > On Tue, 30 Aug 2011 09:04:24 +0200 > Johannes Weiner <jweiner@redhat.com> wrote: > > > On Tue, Aug 30, 2011 at 10:12:33AM +0900, KAMEZAWA Hiroyuki wrote: > > > @@ -1710,11 +1711,18 @@ static void mem_cgroup_record_scanstat(s > > > spin_lock(&memcg->scanstat.lock); > > > __mem_cgroup_record_scanstat(memcg->scanstat.stats[context], rec); > > > spin_unlock(&memcg->scanstat.lock); > > > - > > > - memcg = rec->root; > > > - spin_lock(&memcg->scanstat.lock); > > > - __mem_cgroup_record_scanstat(memcg->scanstat.rootstats[context], rec); > > > - spin_unlock(&memcg->scanstat.lock); > > > + cgroup = memcg->css.cgroup; > > > + do { > > > + spin_lock(&memcg->scanstat.lock); > > > + __mem_cgroup_record_scanstat( > > > + memcg->scanstat.hierarchy_stats[context], rec); > > > + spin_unlock(&memcg->scanstat.lock); > > > + if (!cgroup->parent) > > > + break; > > > + cgroup = cgroup->parent; > > > + memcg = mem_cgroup_from_cont(cgroup); > > > + } while (memcg->use_hierarchy && memcg != rec->root); > > > > Okay, so this looks correct, but it sums up all parents after each > > memcg scanned, which could have a performance impact. Usually, > > hierarchy statistics are only summed up when a user reads them. > > > Hmm. But sum-at-read doesn't work. > > Assume 3 cgroups in a hierarchy. > > A > / > B > / > C > > C's scan contains 3 causes. > C's scan caused by limit of A. > C's scan caused by limit of B. > C's scan caused by limit of C. > > If we make hierarchy sum at read, we think > B's scan_stat = B's scan_stat + C's scan_stat > But in precice, this is > > B's scan_stat = B's scan_stat caused by B + > B's scan_stat caused by A + > C's scan_stat caused by C + > C's scan_stat caused by B + > C's scan_stat caused by A. > > In orignal version. > B's scan_stat = B's scan_stat caused by B + > C's scan_stat caused by B + > > After this patch, > B's scan_stat = B's scan_stat caused by B + > B's scan_stat caused by A + > C's scan_stat caused by C + > C's scan_stat caused by B + > C's scan_stat caused by A. > > Hmm...removing hierarchy part completely seems fine to me. I see. You want to look at A and see whether its limit was responsible for reclaim scans in any children. IMO, that is asking the question backwards. Instead, there is a cgroup under reclaim and one wants to find out the cause for that. Not the other way round. In my original proposal I suggested differentiating reclaim caused by internal pressure (due to own limit) and reclaim caused by external/hierarchical pressure (due to limits from parents). If you want to find out why C is under reclaim, look at its reclaim statistics. If the _limit numbers are high, C's limit is the problem. If the _hierarchical numbers are high, the problem is B, A, or physical memory, so you check B for _limit and _hierarchical as well, then move on to A. Implementing this would be as easy as passing not only the memcg to scan (victim) to the reclaim code, but also the memcg /causing/ the reclaim (root_mem): root_mem == victim -> account to victim as _limit root_mem != victim -> account to victim as _hierarchical This would make things much simpler and more natural, both the code and the way of tracking down a problem, IMO. > > I don't get why this has to be done completely different from the way > > we usually do things, without any justification, whatsoever. > > > > Why do you want to pass a recording structure down the reclaim stack? > > Just for reducing number of passed variables. It's still sitting on bottom of the reclaim stack the whole time. With my proposal, you would only need to pass the extra root_mem pointer. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" 2011-08-30 8:42 ` Johannes Weiner @ 2011-08-30 8:56 ` KAMEZAWA Hiroyuki -1 siblings, 0 replies; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-08-30 8:56 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Tue, 30 Aug 2011 10:42:45 +0200 Johannes Weiner <jweiner@redhat.com> wrote: > On Tue, Aug 30, 2011 at 04:20:50PM +0900, KAMEZAWA Hiroyuki wrote: > > On Tue, 30 Aug 2011 09:04:24 +0200 > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > On Tue, Aug 30, 2011 at 10:12:33AM +0900, KAMEZAWA Hiroyuki wrote: > > > > @@ -1710,11 +1711,18 @@ static void mem_cgroup_record_scanstat(s > > > > spin_lock(&memcg->scanstat.lock); > > > > __mem_cgroup_record_scanstat(memcg->scanstat.stats[context], rec); > > > > spin_unlock(&memcg->scanstat.lock); > > > > - > > > > - memcg = rec->root; > > > > - spin_lock(&memcg->scanstat.lock); > > > > - __mem_cgroup_record_scanstat(memcg->scanstat.rootstats[context], rec); > > > > - spin_unlock(&memcg->scanstat.lock); > > > > + cgroup = memcg->css.cgroup; > > > > + do { > > > > + spin_lock(&memcg->scanstat.lock); > > > > + __mem_cgroup_record_scanstat( > > > > + memcg->scanstat.hierarchy_stats[context], rec); > > > > + spin_unlock(&memcg->scanstat.lock); > > > > + if (!cgroup->parent) > > > > + break; > > > > + cgroup = cgroup->parent; > > > > + memcg = mem_cgroup_from_cont(cgroup); > > > > + } while (memcg->use_hierarchy && memcg != rec->root); > > > > > > Okay, so this looks correct, but it sums up all parents after each > > > memcg scanned, which could have a performance impact. Usually, > > > hierarchy statistics are only summed up when a user reads them. > > > > > Hmm. But sum-at-read doesn't work. > > > > Assume 3 cgroups in a hierarchy. > > > > A > > / > > B > > / > > C > > > > C's scan contains 3 causes. > > C's scan caused by limit of A. > > C's scan caused by limit of B. > > C's scan caused by limit of C. > > > > If we make hierarchy sum at read, we think > > B's scan_stat = B's scan_stat + C's scan_stat > > But in precice, this is > > > > B's scan_stat = B's scan_stat caused by B + > > B's scan_stat caused by A + > > C's scan_stat caused by C + > > C's scan_stat caused by B + > > C's scan_stat caused by A. > > > > In orignal version. > > B's scan_stat = B's scan_stat caused by B + > > C's scan_stat caused by B + > > > > After this patch, > > B's scan_stat = B's scan_stat caused by B + > > B's scan_stat caused by A + > > C's scan_stat caused by C + > > C's scan_stat caused by B + > > C's scan_stat caused by A. > > > > Hmm...removing hierarchy part completely seems fine to me. > > I see. > > You want to look at A and see whether its limit was responsible for > reclaim scans in any children. IMO, that is asking the question > backwards. Instead, there is a cgroup under reclaim and one wants to > find out the cause for that. Not the other way round. > > In my original proposal I suggested differentiating reclaim caused by > internal pressure (due to own limit) and reclaim caused by > external/hierarchical pressure (due to limits from parents). > > If you want to find out why C is under reclaim, look at its reclaim > statistics. If the _limit numbers are high, C's limit is the problem. > If the _hierarchical numbers are high, the problem is B, A, or > physical memory, so you check B for _limit and _hierarchical as well, > then move on to A. > > Implementing this would be as easy as passing not only the memcg to > scan (victim) to the reclaim code, but also the memcg /causing/ the > reclaim (root_mem): > > root_mem == victim -> account to victim as _limit > root_mem != victim -> account to victim as _hierarchical > > This would make things much simpler and more natural, both the code > and the way of tracking down a problem, IMO. > hmm. I have no strong opinion. > > > I don't get why this has to be done completely different from the way > > > we usually do things, without any justification, whatsoever. > > > > > > Why do you want to pass a recording structure down the reclaim stack? > > > > Just for reducing number of passed variables. > > It's still sitting on bottom of the reclaim stack the whole time. > > With my proposal, you would only need to pass the extra root_mem > pointer. > I'm sorry I miss something. Do you say to add a function like mem_cgroup_record_reclaim_stat(memcg, root_mem, anon_scan, anon_free, anon_rotate, file_scan, file_free, elapsed_ns) ? I'll prepare a patch, tomorrow. Thanks, -Kame ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" @ 2011-08-30 8:56 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-08-30 8:56 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Tue, 30 Aug 2011 10:42:45 +0200 Johannes Weiner <jweiner@redhat.com> wrote: > On Tue, Aug 30, 2011 at 04:20:50PM +0900, KAMEZAWA Hiroyuki wrote: > > On Tue, 30 Aug 2011 09:04:24 +0200 > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > On Tue, Aug 30, 2011 at 10:12:33AM +0900, KAMEZAWA Hiroyuki wrote: > > > > @@ -1710,11 +1711,18 @@ static void mem_cgroup_record_scanstat(s > > > > spin_lock(&memcg->scanstat.lock); > > > > __mem_cgroup_record_scanstat(memcg->scanstat.stats[context], rec); > > > > spin_unlock(&memcg->scanstat.lock); > > > > - > > > > - memcg = rec->root; > > > > - spin_lock(&memcg->scanstat.lock); > > > > - __mem_cgroup_record_scanstat(memcg->scanstat.rootstats[context], rec); > > > > - spin_unlock(&memcg->scanstat.lock); > > > > + cgroup = memcg->css.cgroup; > > > > + do { > > > > + spin_lock(&memcg->scanstat.lock); > > > > + __mem_cgroup_record_scanstat( > > > > + memcg->scanstat.hierarchy_stats[context], rec); > > > > + spin_unlock(&memcg->scanstat.lock); > > > > + if (!cgroup->parent) > > > > + break; > > > > + cgroup = cgroup->parent; > > > > + memcg = mem_cgroup_from_cont(cgroup); > > > > + } while (memcg->use_hierarchy && memcg != rec->root); > > > > > > Okay, so this looks correct, but it sums up all parents after each > > > memcg scanned, which could have a performance impact. Usually, > > > hierarchy statistics are only summed up when a user reads them. > > > > > Hmm. But sum-at-read doesn't work. > > > > Assume 3 cgroups in a hierarchy. > > > > A > > / > > B > > / > > C > > > > C's scan contains 3 causes. > > C's scan caused by limit of A. > > C's scan caused by limit of B. > > C's scan caused by limit of C. > > > > If we make hierarchy sum at read, we think > > B's scan_stat = B's scan_stat + C's scan_stat > > But in precice, this is > > > > B's scan_stat = B's scan_stat caused by B + > > B's scan_stat caused by A + > > C's scan_stat caused by C + > > C's scan_stat caused by B + > > C's scan_stat caused by A. > > > > In orignal version. > > B's scan_stat = B's scan_stat caused by B + > > C's scan_stat caused by B + > > > > After this patch, > > B's scan_stat = B's scan_stat caused by B + > > B's scan_stat caused by A + > > C's scan_stat caused by C + > > C's scan_stat caused by B + > > C's scan_stat caused by A. > > > > Hmm...removing hierarchy part completely seems fine to me. > > I see. > > You want to look at A and see whether its limit was responsible for > reclaim scans in any children. IMO, that is asking the question > backwards. Instead, there is a cgroup under reclaim and one wants to > find out the cause for that. Not the other way round. > > In my original proposal I suggested differentiating reclaim caused by > internal pressure (due to own limit) and reclaim caused by > external/hierarchical pressure (due to limits from parents). > > If you want to find out why C is under reclaim, look at its reclaim > statistics. If the _limit numbers are high, C's limit is the problem. > If the _hierarchical numbers are high, the problem is B, A, or > physical memory, so you check B for _limit and _hierarchical as well, > then move on to A. > > Implementing this would be as easy as passing not only the memcg to > scan (victim) to the reclaim code, but also the memcg /causing/ the > reclaim (root_mem): > > root_mem == victim -> account to victim as _limit > root_mem != victim -> account to victim as _hierarchical > > This would make things much simpler and more natural, both the code > and the way of tracking down a problem, IMO. > hmm. I have no strong opinion. > > > I don't get why this has to be done completely different from the way > > > we usually do things, without any justification, whatsoever. > > > > > > Why do you want to pass a recording structure down the reclaim stack? > > > > Just for reducing number of passed variables. > > It's still sitting on bottom of the reclaim stack the whole time. > > With my proposal, you would only need to pass the extra root_mem > pointer. > I'm sorry I miss something. Do you say to add a function like mem_cgroup_record_reclaim_stat(memcg, root_mem, anon_scan, anon_free, anon_rotate, file_scan, file_free, elapsed_ns) ? I'll prepare a patch, tomorrow. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" 2011-08-30 8:56 ` KAMEZAWA Hiroyuki @ 2011-08-30 10:17 ` Johannes Weiner -1 siblings, 0 replies; 54+ messages in thread From: Johannes Weiner @ 2011-08-30 10:17 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Tue, Aug 30, 2011 at 05:56:09PM +0900, KAMEZAWA Hiroyuki wrote: > On Tue, 30 Aug 2011 10:42:45 +0200 > Johannes Weiner <jweiner@redhat.com> wrote: > > > On Tue, Aug 30, 2011 at 04:20:50PM +0900, KAMEZAWA Hiroyuki wrote: > > > On Tue, 30 Aug 2011 09:04:24 +0200 > > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > > > On Tue, Aug 30, 2011 at 10:12:33AM +0900, KAMEZAWA Hiroyuki wrote: > > > > > @@ -1710,11 +1711,18 @@ static void mem_cgroup_record_scanstat(s > > > > > spin_lock(&memcg->scanstat.lock); > > > > > __mem_cgroup_record_scanstat(memcg->scanstat.stats[context], rec); > > > > > spin_unlock(&memcg->scanstat.lock); > > > > > - > > > > > - memcg = rec->root; > > > > > - spin_lock(&memcg->scanstat.lock); > > > > > - __mem_cgroup_record_scanstat(memcg->scanstat.rootstats[context], rec); > > > > > - spin_unlock(&memcg->scanstat.lock); > > > > > + cgroup = memcg->css.cgroup; > > > > > + do { > > > > > + spin_lock(&memcg->scanstat.lock); > > > > > + __mem_cgroup_record_scanstat( > > > > > + memcg->scanstat.hierarchy_stats[context], rec); > > > > > + spin_unlock(&memcg->scanstat.lock); > > > > > + if (!cgroup->parent) > > > > > + break; > > > > > + cgroup = cgroup->parent; > > > > > + memcg = mem_cgroup_from_cont(cgroup); > > > > > + } while (memcg->use_hierarchy && memcg != rec->root); > > > > > > > > Okay, so this looks correct, but it sums up all parents after each > > > > memcg scanned, which could have a performance impact. Usually, > > > > hierarchy statistics are only summed up when a user reads them. > > > > > > > Hmm. But sum-at-read doesn't work. > > > > > > Assume 3 cgroups in a hierarchy. > > > > > > A > > > / > > > B > > > / > > > C > > > > > > C's scan contains 3 causes. > > > C's scan caused by limit of A. > > > C's scan caused by limit of B. > > > C's scan caused by limit of C. > > > > > > If we make hierarchy sum at read, we think > > > B's scan_stat = B's scan_stat + C's scan_stat > > > But in precice, this is > > > > > > B's scan_stat = B's scan_stat caused by B + > > > B's scan_stat caused by A + > > > C's scan_stat caused by C + > > > C's scan_stat caused by B + > > > C's scan_stat caused by A. > > > > > > In orignal version. > > > B's scan_stat = B's scan_stat caused by B + > > > C's scan_stat caused by B + > > > > > > After this patch, > > > B's scan_stat = B's scan_stat caused by B + > > > B's scan_stat caused by A + > > > C's scan_stat caused by C + > > > C's scan_stat caused by B + > > > C's scan_stat caused by A. > > > > > > Hmm...removing hierarchy part completely seems fine to me. > > > > I see. > > > > You want to look at A and see whether its limit was responsible for > > reclaim scans in any children. IMO, that is asking the question > > backwards. Instead, there is a cgroup under reclaim and one wants to > > find out the cause for that. Not the other way round. > > > > In my original proposal I suggested differentiating reclaim caused by > > internal pressure (due to own limit) and reclaim caused by > > external/hierarchical pressure (due to limits from parents). > > > > If you want to find out why C is under reclaim, look at its reclaim > > statistics. If the _limit numbers are high, C's limit is the problem. > > If the _hierarchical numbers are high, the problem is B, A, or > > physical memory, so you check B for _limit and _hierarchical as well, > > then move on to A. > > > > Implementing this would be as easy as passing not only the memcg to > > scan (victim) to the reclaim code, but also the memcg /causing/ the > > reclaim (root_mem): > > > > root_mem == victim -> account to victim as _limit > > root_mem != victim -> account to victim as _hierarchical > > > > This would make things much simpler and more natural, both the code > > and the way of tracking down a problem, IMO. > > hmm. I have no strong opinion. I do :-) > > > > I don't get why this has to be done completely different from the way > > > > we usually do things, without any justification, whatsoever. > > > > > > > > Why do you want to pass a recording structure down the reclaim stack? > > > > > > Just for reducing number of passed variables. > > > > It's still sitting on bottom of the reclaim stack the whole time. > > > > With my proposal, you would only need to pass the extra root_mem > > pointer. > > I'm sorry I miss something. Do you say to add a function like > > mem_cgroup_record_reclaim_stat(memcg, root_mem, anon_scan, anon_free, anon_rotate, > file_scan, file_free, elapsed_ns) > > ? Exactly, though passing it a stat item index and a delta would probably be closer to our other statistics accounting, i.e.: mem_cgroup_record_reclaim_stat(sc->mem_cgroup, sc->root_mem_cgroup, MEM_CGROUP_SCAN_ANON, *nr_anon); where sc->mem_cgroup is `victim' and sc->root_mem_cgroup is `root_mem' from hierarchical_reclaim. ->root_mem_cgroup might be confusing, though. I named it ->target_mem_cgroup in my patch set but I don't feel too strongly about that. Even better would be to reuse enum vm_event_item and at one point merge all the accounting stuff into a single function and have one single set of events that makes sense on a global level as well as on a per-memcg level. There is deviation and implementing similar things twice with slight variations and I don't see any justification for all that extra code that needs maintaining. Or counters that have similar names globally and on a per-memcg level but with different meanings, like the rotated counter. Globally, a rotated page (PGROTATED) is one that is moved back to the inactive list after writeback finishes. Per-memcg, the rotated counter is our internal heuristics value to balance pressure between LRUs and means either rotated on the inactive list, activated, not activated but countes as activated because of VM_EXEC etc. I am still for reverting this patch before the release until we have this all sorted out. I feel rather strongly that these statistics are in no way ready to make them part of the ABI and export them to userspace as they are now. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" @ 2011-08-30 10:17 ` Johannes Weiner 0 siblings, 0 replies; 54+ messages in thread From: Johannes Weiner @ 2011-08-30 10:17 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Tue, Aug 30, 2011 at 05:56:09PM +0900, KAMEZAWA Hiroyuki wrote: > On Tue, 30 Aug 2011 10:42:45 +0200 > Johannes Weiner <jweiner@redhat.com> wrote: > > > On Tue, Aug 30, 2011 at 04:20:50PM +0900, KAMEZAWA Hiroyuki wrote: > > > On Tue, 30 Aug 2011 09:04:24 +0200 > > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > > > On Tue, Aug 30, 2011 at 10:12:33AM +0900, KAMEZAWA Hiroyuki wrote: > > > > > @@ -1710,11 +1711,18 @@ static void mem_cgroup_record_scanstat(s > > > > > spin_lock(&memcg->scanstat.lock); > > > > > __mem_cgroup_record_scanstat(memcg->scanstat.stats[context], rec); > > > > > spin_unlock(&memcg->scanstat.lock); > > > > > - > > > > > - memcg = rec->root; > > > > > - spin_lock(&memcg->scanstat.lock); > > > > > - __mem_cgroup_record_scanstat(memcg->scanstat.rootstats[context], rec); > > > > > - spin_unlock(&memcg->scanstat.lock); > > > > > + cgroup = memcg->css.cgroup; > > > > > + do { > > > > > + spin_lock(&memcg->scanstat.lock); > > > > > + __mem_cgroup_record_scanstat( > > > > > + memcg->scanstat.hierarchy_stats[context], rec); > > > > > + spin_unlock(&memcg->scanstat.lock); > > > > > + if (!cgroup->parent) > > > > > + break; > > > > > + cgroup = cgroup->parent; > > > > > + memcg = mem_cgroup_from_cont(cgroup); > > > > > + } while (memcg->use_hierarchy && memcg != rec->root); > > > > > > > > Okay, so this looks correct, but it sums up all parents after each > > > > memcg scanned, which could have a performance impact. Usually, > > > > hierarchy statistics are only summed up when a user reads them. > > > > > > > Hmm. But sum-at-read doesn't work. > > > > > > Assume 3 cgroups in a hierarchy. > > > > > > A > > > / > > > B > > > / > > > C > > > > > > C's scan contains 3 causes. > > > C's scan caused by limit of A. > > > C's scan caused by limit of B. > > > C's scan caused by limit of C. > > > > > > If we make hierarchy sum at read, we think > > > B's scan_stat = B's scan_stat + C's scan_stat > > > But in precice, this is > > > > > > B's scan_stat = B's scan_stat caused by B + > > > B's scan_stat caused by A + > > > C's scan_stat caused by C + > > > C's scan_stat caused by B + > > > C's scan_stat caused by A. > > > > > > In orignal version. > > > B's scan_stat = B's scan_stat caused by B + > > > C's scan_stat caused by B + > > > > > > After this patch, > > > B's scan_stat = B's scan_stat caused by B + > > > B's scan_stat caused by A + > > > C's scan_stat caused by C + > > > C's scan_stat caused by B + > > > C's scan_stat caused by A. > > > > > > Hmm...removing hierarchy part completely seems fine to me. > > > > I see. > > > > You want to look at A and see whether its limit was responsible for > > reclaim scans in any children. IMO, that is asking the question > > backwards. Instead, there is a cgroup under reclaim and one wants to > > find out the cause for that. Not the other way round. > > > > In my original proposal I suggested differentiating reclaim caused by > > internal pressure (due to own limit) and reclaim caused by > > external/hierarchical pressure (due to limits from parents). > > > > If you want to find out why C is under reclaim, look at its reclaim > > statistics. If the _limit numbers are high, C's limit is the problem. > > If the _hierarchical numbers are high, the problem is B, A, or > > physical memory, so you check B for _limit and _hierarchical as well, > > then move on to A. > > > > Implementing this would be as easy as passing not only the memcg to > > scan (victim) to the reclaim code, but also the memcg /causing/ the > > reclaim (root_mem): > > > > root_mem == victim -> account to victim as _limit > > root_mem != victim -> account to victim as _hierarchical > > > > This would make things much simpler and more natural, both the code > > and the way of tracking down a problem, IMO. > > hmm. I have no strong opinion. I do :-) > > > > I don't get why this has to be done completely different from the way > > > > we usually do things, without any justification, whatsoever. > > > > > > > > Why do you want to pass a recording structure down the reclaim stack? > > > > > > Just for reducing number of passed variables. > > > > It's still sitting on bottom of the reclaim stack the whole time. > > > > With my proposal, you would only need to pass the extra root_mem > > pointer. > > I'm sorry I miss something. Do you say to add a function like > > mem_cgroup_record_reclaim_stat(memcg, root_mem, anon_scan, anon_free, anon_rotate, > file_scan, file_free, elapsed_ns) > > ? Exactly, though passing it a stat item index and a delta would probably be closer to our other statistics accounting, i.e.: mem_cgroup_record_reclaim_stat(sc->mem_cgroup, sc->root_mem_cgroup, MEM_CGROUP_SCAN_ANON, *nr_anon); where sc->mem_cgroup is `victim' and sc->root_mem_cgroup is `root_mem' from hierarchical_reclaim. ->root_mem_cgroup might be confusing, though. I named it ->target_mem_cgroup in my patch set but I don't feel too strongly about that. Even better would be to reuse enum vm_event_item and at one point merge all the accounting stuff into a single function and have one single set of events that makes sense on a global level as well as on a per-memcg level. There is deviation and implementing similar things twice with slight variations and I don't see any justification for all that extra code that needs maintaining. Or counters that have similar names globally and on a per-memcg level but with different meanings, like the rotated counter. Globally, a rotated page (PGROTATED) is one that is moved back to the inactive list after writeback finishes. Per-memcg, the rotated counter is our internal heuristics value to balance pressure between LRUs and means either rotated on the inactive list, activated, not activated but countes as activated because of VM_EXEC etc. I am still for reverting this patch before the release until we have this all sorted out. I feel rather strongly that these statistics are in no way ready to make them part of the ABI and export them to userspace as they are now. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" 2011-08-30 10:17 ` Johannes Weiner @ 2011-08-30 10:34 ` KAMEZAWA Hiroyuki -1 siblings, 0 replies; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-08-30 10:34 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Tue, 30 Aug 2011 12:17:26 +0200 Johannes Weiner <jweiner@redhat.com> wrote: > On Tue, Aug 30, 2011 at 05:56:09PM +0900, KAMEZAWA Hiroyuki wrote: > > > > > I don't get why this has to be done completely different from the way > > > > > we usually do things, without any justification, whatsoever. > > > > > > > > > > Why do you want to pass a recording structure down the reclaim stack? > > > > > > > > Just for reducing number of passed variables. > > > > > > It's still sitting on bottom of the reclaim stack the whole time. > > > > > > With my proposal, you would only need to pass the extra root_mem > > > pointer. > > > > I'm sorry I miss something. Do you say to add a function like > > > > mem_cgroup_record_reclaim_stat(memcg, root_mem, anon_scan, anon_free, anon_rotate, > > file_scan, file_free, elapsed_ns) > > > > ? > > Exactly, though passing it a stat item index and a delta would > probably be closer to our other statistics accounting, i.e.: > > mem_cgroup_record_reclaim_stat(sc->mem_cgroup, sc->root_mem_cgroup, > MEM_CGROUP_SCAN_ANON, *nr_anon); > > where sc->mem_cgroup is `victim' and sc->root_mem_cgroup is `root_mem' > from hierarchical_reclaim. ->root_mem_cgroup might be confusing, > though. I named it ->target_mem_cgroup in my patch set but I don't > feel too strongly about that. > > Even better would be to reuse enum vm_event_item and at one point > merge all the accounting stuff into a single function and have one > single set of events that makes sense on a global level as well as on > a per-memcg level. > > There is deviation and implementing similar things twice with slight > variations and I don't see any justification for all that extra code > that needs maintaining. Or counters that have similar names globally > and on a per-memcg level but with different meanings, like the rotated > counter. Globally, a rotated page (PGROTATED) is one that is moved > back to the inactive list after writeback finishes. Per-memcg, the > rotated counter is our internal heuristics value to balance pressure > between LRUs and means either rotated on the inactive list, activated, > not activated but countes as activated because of VM_EXEC etc. > > I am still for reverting this patch before the release until we have > this all sorted out. I feel rather strongly that these statistics are > in no way ready to make them part of the ABI and export them to > userspace as they are now. > How about fixing interface first ? 1st version of this patch was in April and no big change since then. I don't want to be starved more. Thanks, -Kame ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" @ 2011-08-30 10:34 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-08-30 10:34 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Tue, 30 Aug 2011 12:17:26 +0200 Johannes Weiner <jweiner@redhat.com> wrote: > On Tue, Aug 30, 2011 at 05:56:09PM +0900, KAMEZAWA Hiroyuki wrote: > > > > > I don't get why this has to be done completely different from the way > > > > > we usually do things, without any justification, whatsoever. > > > > > > > > > > Why do you want to pass a recording structure down the reclaim stack? > > > > > > > > Just for reducing number of passed variables. > > > > > > It's still sitting on bottom of the reclaim stack the whole time. > > > > > > With my proposal, you would only need to pass the extra root_mem > > > pointer. > > > > I'm sorry I miss something. Do you say to add a function like > > > > mem_cgroup_record_reclaim_stat(memcg, root_mem, anon_scan, anon_free, anon_rotate, > > file_scan, file_free, elapsed_ns) > > > > ? > > Exactly, though passing it a stat item index and a delta would > probably be closer to our other statistics accounting, i.e.: > > mem_cgroup_record_reclaim_stat(sc->mem_cgroup, sc->root_mem_cgroup, > MEM_CGROUP_SCAN_ANON, *nr_anon); > > where sc->mem_cgroup is `victim' and sc->root_mem_cgroup is `root_mem' > from hierarchical_reclaim. ->root_mem_cgroup might be confusing, > though. I named it ->target_mem_cgroup in my patch set but I don't > feel too strongly about that. > > Even better would be to reuse enum vm_event_item and at one point > merge all the accounting stuff into a single function and have one > single set of events that makes sense on a global level as well as on > a per-memcg level. > > There is deviation and implementing similar things twice with slight > variations and I don't see any justification for all that extra code > that needs maintaining. Or counters that have similar names globally > and on a per-memcg level but with different meanings, like the rotated > counter. Globally, a rotated page (PGROTATED) is one that is moved > back to the inactive list after writeback finishes. Per-memcg, the > rotated counter is our internal heuristics value to balance pressure > between LRUs and means either rotated on the inactive list, activated, > not activated but countes as activated because of VM_EXEC etc. > > I am still for reverting this patch before the release until we have > this all sorted out. I feel rather strongly that these statistics are > in no way ready to make them part of the ABI and export them to > userspace as they are now. > How about fixing interface first ? 1st version of this patch was in April and no big change since then. I don't want to be starved more. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" 2011-08-30 10:34 ` KAMEZAWA Hiroyuki @ 2011-08-30 11:03 ` Johannes Weiner -1 siblings, 0 replies; 54+ messages in thread From: Johannes Weiner @ 2011-08-30 11:03 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Tue, Aug 30, 2011 at 07:34:06PM +0900, KAMEZAWA Hiroyuki wrote: > On Tue, 30 Aug 2011 12:17:26 +0200 > Johannes Weiner <jweiner@redhat.com> wrote: > > > On Tue, Aug 30, 2011 at 05:56:09PM +0900, KAMEZAWA Hiroyuki wrote: > > > > > > > I don't get why this has to be done completely different from the way > > > > > > we usually do things, without any justification, whatsoever. > > > > > > > > > > > > Why do you want to pass a recording structure down the reclaim stack? > > > > > > > > > > Just for reducing number of passed variables. > > > > > > > > It's still sitting on bottom of the reclaim stack the whole time. > > > > > > > > With my proposal, you would only need to pass the extra root_mem > > > > pointer. > > > > > > I'm sorry I miss something. Do you say to add a function like > > > > > > mem_cgroup_record_reclaim_stat(memcg, root_mem, anon_scan, anon_free, anon_rotate, > > > file_scan, file_free, elapsed_ns) > > > > > > ? > > > > Exactly, though passing it a stat item index and a delta would > > probably be closer to our other statistics accounting, i.e.: > > > > mem_cgroup_record_reclaim_stat(sc->mem_cgroup, sc->root_mem_cgroup, > > MEM_CGROUP_SCAN_ANON, *nr_anon); > > > > where sc->mem_cgroup is `victim' and sc->root_mem_cgroup is `root_mem' > > from hierarchical_reclaim. ->root_mem_cgroup might be confusing, > > though. I named it ->target_mem_cgroup in my patch set but I don't > > feel too strongly about that. > > > > Even better would be to reuse enum vm_event_item and at one point > > merge all the accounting stuff into a single function and have one > > single set of events that makes sense on a global level as well as on > > a per-memcg level. > > > > There is deviation and implementing similar things twice with slight > > variations and I don't see any justification for all that extra code > > that needs maintaining. Or counters that have similar names globally > > and on a per-memcg level but with different meanings, like the rotated > > counter. Globally, a rotated page (PGROTATED) is one that is moved > > back to the inactive list after writeback finishes. Per-memcg, the > > rotated counter is our internal heuristics value to balance pressure > > between LRUs and means either rotated on the inactive list, activated, > > not activated but countes as activated because of VM_EXEC etc. > > > > I am still for reverting this patch before the release until we have > > this all sorted out. I feel rather strongly that these statistics are > > in no way ready to make them part of the ABI and export them to > > userspace as they are now. > > How about fixing interface first ? 1st version of this patch was > in April and no big change since then. > I don't want to be starved more. Back then I mentioned all my concerns and alternate suggestions. Different from you, I explained and provided a reason for every single counter I wanted to add, suggested a basic pattern for how to interpret them to gain insight into memcg configurations and their behaviour. No reaction. If you want to make progress, than don't ignore concerns and arguments. If my arguments are crap, then tell me why and we can move on. What we have now is not ready. It wasn't discussed properly, which certainly wasn't for the lack of interest in this change. I just got tired of raising the same points over and over again without answer. The amount of time a change has been around is not an argument for it to get merged. On the other hand, the fact that it hasn't changed since April *even though* the implementation was opposed back then doesn't really speak for your way of getting this upstream, does it? ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" @ 2011-08-30 11:03 ` Johannes Weiner 0 siblings, 0 replies; 54+ messages in thread From: Johannes Weiner @ 2011-08-30 11:03 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Tue, Aug 30, 2011 at 07:34:06PM +0900, KAMEZAWA Hiroyuki wrote: > On Tue, 30 Aug 2011 12:17:26 +0200 > Johannes Weiner <jweiner@redhat.com> wrote: > > > On Tue, Aug 30, 2011 at 05:56:09PM +0900, KAMEZAWA Hiroyuki wrote: > > > > > > > I don't get why this has to be done completely different from the way > > > > > > we usually do things, without any justification, whatsoever. > > > > > > > > > > > > Why do you want to pass a recording structure down the reclaim stack? > > > > > > > > > > Just for reducing number of passed variables. > > > > > > > > It's still sitting on bottom of the reclaim stack the whole time. > > > > > > > > With my proposal, you would only need to pass the extra root_mem > > > > pointer. > > > > > > I'm sorry I miss something. Do you say to add a function like > > > > > > mem_cgroup_record_reclaim_stat(memcg, root_mem, anon_scan, anon_free, anon_rotate, > > > file_scan, file_free, elapsed_ns) > > > > > > ? > > > > Exactly, though passing it a stat item index and a delta would > > probably be closer to our other statistics accounting, i.e.: > > > > mem_cgroup_record_reclaim_stat(sc->mem_cgroup, sc->root_mem_cgroup, > > MEM_CGROUP_SCAN_ANON, *nr_anon); > > > > where sc->mem_cgroup is `victim' and sc->root_mem_cgroup is `root_mem' > > from hierarchical_reclaim. ->root_mem_cgroup might be confusing, > > though. I named it ->target_mem_cgroup in my patch set but I don't > > feel too strongly about that. > > > > Even better would be to reuse enum vm_event_item and at one point > > merge all the accounting stuff into a single function and have one > > single set of events that makes sense on a global level as well as on > > a per-memcg level. > > > > There is deviation and implementing similar things twice with slight > > variations and I don't see any justification for all that extra code > > that needs maintaining. Or counters that have similar names globally > > and on a per-memcg level but with different meanings, like the rotated > > counter. Globally, a rotated page (PGROTATED) is one that is moved > > back to the inactive list after writeback finishes. Per-memcg, the > > rotated counter is our internal heuristics value to balance pressure > > between LRUs and means either rotated on the inactive list, activated, > > not activated but countes as activated because of VM_EXEC etc. > > > > I am still for reverting this patch before the release until we have > > this all sorted out. I feel rather strongly that these statistics are > > in no way ready to make them part of the ABI and export them to > > userspace as they are now. > > How about fixing interface first ? 1st version of this patch was > in April and no big change since then. > I don't want to be starved more. Back then I mentioned all my concerns and alternate suggestions. Different from you, I explained and provided a reason for every single counter I wanted to add, suggested a basic pattern for how to interpret them to gain insight into memcg configurations and their behaviour. No reaction. If you want to make progress, than don't ignore concerns and arguments. If my arguments are crap, then tell me why and we can move on. What we have now is not ready. It wasn't discussed properly, which certainly wasn't for the lack of interest in this change. I just got tired of raising the same points over and over again without answer. The amount of time a change has been around is not an argument for it to get merged. On the other hand, the fact that it hasn't changed since April *even though* the implementation was opposed back then doesn't really speak for your way of getting this upstream, does it? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" 2011-08-30 11:03 ` Johannes Weiner @ 2011-08-30 23:38 ` KAMEZAWA Hiroyuki -1 siblings, 0 replies; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-08-30 23:38 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Tue, 30 Aug 2011 13:03:37 +0200 Johannes Weiner <jweiner@redhat.com> wrote: > On Tue, Aug 30, 2011 at 07:34:06PM +0900, KAMEZAWA Hiroyuki wrote: > > On Tue, 30 Aug 2011 12:17:26 +0200 > > Johannes Weiner <jweiner@redhat.com> wrote: > > How about fixing interface first ? 1st version of this patch was > > in April and no big change since then. > > I don't want to be starved more. > > Back then I mentioned all my concerns and alternate suggestions. > Different from you, I explained and provided a reason for every single > counter I wanted to add, suggested a basic pattern for how to > interpret them to gain insight into memcg configurations and their > behaviour. No reaction. If you want to make progress, than don't > ignore concerns and arguments. If my arguments are crap, then tell me > why and we can move on. > I think having percpu couneter has no performance benefit, just lose extra memory by percpu allocation. Anyway, you can change internal implemenatation when it's necessary. But Ok, I agree using the same style as zone counters is better. > What we have now is not ready. It wasn't discussed properly, which > certainly wasn't for the lack of interest in this change. I just got > tired of raising the same points over and over again without answer. > > The amount of time a change has been around is not an argument for it > to get merged. On the other hand, the fact that it hasn't changed > since April *even though* the implementation was opposed back then > doesn't really speak for your way of getting this upstream, does it? The fact is that you should revert the patch when it's merged to mmotm. Please revert patch. And merge your own. Anyway I don't have much interests in hierarchy. Bye, -Kame ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" @ 2011-08-30 23:38 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-08-30 23:38 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Tue, 30 Aug 2011 13:03:37 +0200 Johannes Weiner <jweiner@redhat.com> wrote: > On Tue, Aug 30, 2011 at 07:34:06PM +0900, KAMEZAWA Hiroyuki wrote: > > On Tue, 30 Aug 2011 12:17:26 +0200 > > Johannes Weiner <jweiner@redhat.com> wrote: > > How about fixing interface first ? 1st version of this patch was > > in April and no big change since then. > > I don't want to be starved more. > > Back then I mentioned all my concerns and alternate suggestions. > Different from you, I explained and provided a reason for every single > counter I wanted to add, suggested a basic pattern for how to > interpret them to gain insight into memcg configurations and their > behaviour. No reaction. If you want to make progress, than don't > ignore concerns and arguments. If my arguments are crap, then tell me > why and we can move on. > I think having percpu couneter has no performance benefit, just lose extra memory by percpu allocation. Anyway, you can change internal implemenatation when it's necessary. But Ok, I agree using the same style as zone counters is better. > What we have now is not ready. It wasn't discussed properly, which > certainly wasn't for the lack of interest in this change. I just got > tired of raising the same points over and over again without answer. > > The amount of time a change has been around is not an argument for it > to get merged. On the other hand, the fact that it hasn't changed > since April *even though* the implementation was opposed back then > doesn't really speak for your way of getting this upstream, does it? The fact is that you should revert the patch when it's merged to mmotm. Please revert patch. And merge your own. Anyway I don't have much interests in hierarchy. Bye, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" 2011-08-30 10:17 ` Johannes Weiner @ 2011-08-30 10:38 ` KAMEZAWA Hiroyuki -1 siblings, 0 replies; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-08-30 10:38 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Tue, 30 Aug 2011 12:17:26 +0200 Johannes Weiner <jweiner@redhat.com> wrote: > On Tue, Aug 30, 2011 at 05:56:09PM +0900, KAMEZAWA Hiroyuki wrote: > > On Tue, 30 Aug 2011 10:42:45 +0200 > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > Assume 3 cgroups in a hierarchy. > > > > > > > > A > > > > / > > > > B > > > > / > > > > C > > > > > > > > C's scan contains 3 causes. > > > > C's scan caused by limit of A. > > > > C's scan caused by limit of B. > > > > C's scan caused by limit of C. > > > > > > > > If we make hierarchy sum at read, we think > > > > B's scan_stat = B's scan_stat + C's scan_stat > > > > But in precice, this is > > > > > > > > B's scan_stat = B's scan_stat caused by B + > > > > B's scan_stat caused by A + > > > > C's scan_stat caused by C + > > > > C's scan_stat caused by B + > > > > C's scan_stat caused by A. > > > > > > > > In orignal version. > > > > B's scan_stat = B's scan_stat caused by B + > > > > C's scan_stat caused by B + > > > > > > > > After this patch, > > > > B's scan_stat = B's scan_stat caused by B + > > > > B's scan_stat caused by A + > > > > C's scan_stat caused by C + > > > > C's scan_stat caused by B + > > > > C's scan_stat caused by A. > > > > > > > > Hmm...removing hierarchy part completely seems fine to me. > > > > > > I see. > > > > > > You want to look at A and see whether its limit was responsible for > > > reclaim scans in any children. IMO, that is asking the question > > > backwards. Instead, there is a cgroup under reclaim and one wants to > > > find out the cause for that. Not the other way round. > > > > > > In my original proposal I suggested differentiating reclaim caused by > > > internal pressure (due to own limit) and reclaim caused by > > > external/hierarchical pressure (due to limits from parents). > > > > > > If you want to find out why C is under reclaim, look at its reclaim > > > statistics. If the _limit numbers are high, C's limit is the problem. > > > If the _hierarchical numbers are high, the problem is B, A, or > > > physical memory, so you check B for _limit and _hierarchical as well, > > > then move on to A. > > > > > > Implementing this would be as easy as passing not only the memcg to > > > scan (victim) to the reclaim code, but also the memcg /causing/ the > > > reclaim (root_mem): > > > > > > root_mem == victim -> account to victim as _limit > > > root_mem != victim -> account to victim as _hierarchical > > > > > > This would make things much simpler and more natural, both the code > > > and the way of tracking down a problem, IMO. > > > > hmm. I have no strong opinion. > > I do :-) > BTW, how to calculate C's lru scan caused by A finally ? A / B / C At scanning LRU of C because of A's limit, where stats are recorded ? If we record it in C, we lose where the memory pressure comes from. If we record it in A, we lose where scan happens. I'm sorry I'm a little confused. Thanks, -Kame ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" @ 2011-08-30 10:38 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-08-30 10:38 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Tue, 30 Aug 2011 12:17:26 +0200 Johannes Weiner <jweiner@redhat.com> wrote: > On Tue, Aug 30, 2011 at 05:56:09PM +0900, KAMEZAWA Hiroyuki wrote: > > On Tue, 30 Aug 2011 10:42:45 +0200 > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > Assume 3 cgroups in a hierarchy. > > > > > > > > A > > > > / > > > > B > > > > / > > > > C > > > > > > > > C's scan contains 3 causes. > > > > C's scan caused by limit of A. > > > > C's scan caused by limit of B. > > > > C's scan caused by limit of C. > > > > > > > > If we make hierarchy sum at read, we think > > > > B's scan_stat = B's scan_stat + C's scan_stat > > > > But in precice, this is > > > > > > > > B's scan_stat = B's scan_stat caused by B + > > > > B's scan_stat caused by A + > > > > C's scan_stat caused by C + > > > > C's scan_stat caused by B + > > > > C's scan_stat caused by A. > > > > > > > > In orignal version. > > > > B's scan_stat = B's scan_stat caused by B + > > > > C's scan_stat caused by B + > > > > > > > > After this patch, > > > > B's scan_stat = B's scan_stat caused by B + > > > > B's scan_stat caused by A + > > > > C's scan_stat caused by C + > > > > C's scan_stat caused by B + > > > > C's scan_stat caused by A. > > > > > > > > Hmm...removing hierarchy part completely seems fine to me. > > > > > > I see. > > > > > > You want to look at A and see whether its limit was responsible for > > > reclaim scans in any children. IMO, that is asking the question > > > backwards. Instead, there is a cgroup under reclaim and one wants to > > > find out the cause for that. Not the other way round. > > > > > > In my original proposal I suggested differentiating reclaim caused by > > > internal pressure (due to own limit) and reclaim caused by > > > external/hierarchical pressure (due to limits from parents). > > > > > > If you want to find out why C is under reclaim, look at its reclaim > > > statistics. If the _limit numbers are high, C's limit is the problem. > > > If the _hierarchical numbers are high, the problem is B, A, or > > > physical memory, so you check B for _limit and _hierarchical as well, > > > then move on to A. > > > > > > Implementing this would be as easy as passing not only the memcg to > > > scan (victim) to the reclaim code, but also the memcg /causing/ the > > > reclaim (root_mem): > > > > > > root_mem == victim -> account to victim as _limit > > > root_mem != victim -> account to victim as _hierarchical > > > > > > This would make things much simpler and more natural, both the code > > > and the way of tracking down a problem, IMO. > > > > hmm. I have no strong opinion. > > I do :-) > BTW, how to calculate C's lru scan caused by A finally ? A / B / C At scanning LRU of C because of A's limit, where stats are recorded ? If we record it in C, we lose where the memory pressure comes from. If we record it in A, we lose where scan happens. I'm sorry I'm a little confused. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" 2011-08-30 10:38 ` KAMEZAWA Hiroyuki @ 2011-08-30 11:32 ` Johannes Weiner -1 siblings, 0 replies; 54+ messages in thread From: Johannes Weiner @ 2011-08-30 11:32 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Tue, Aug 30, 2011 at 07:38:39PM +0900, KAMEZAWA Hiroyuki wrote: > On Tue, 30 Aug 2011 12:17:26 +0200 > Johannes Weiner <jweiner@redhat.com> wrote: > > > On Tue, Aug 30, 2011 at 05:56:09PM +0900, KAMEZAWA Hiroyuki wrote: > > > On Tue, 30 Aug 2011 10:42:45 +0200 > > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > > Assume 3 cgroups in a hierarchy. > > > > > > > > > > A > > > > > / > > > > > B > > > > > / > > > > > C > > > > > > > > > > C's scan contains 3 causes. > > > > > C's scan caused by limit of A. > > > > > C's scan caused by limit of B. > > > > > C's scan caused by limit of C. > > > > > > > > > > If we make hierarchy sum at read, we think > > > > > B's scan_stat = B's scan_stat + C's scan_stat > > > > > But in precice, this is > > > > > > > > > > B's scan_stat = B's scan_stat caused by B + > > > > > B's scan_stat caused by A + > > > > > C's scan_stat caused by C + > > > > > C's scan_stat caused by B + > > > > > C's scan_stat caused by A. > > > > > > > > > > In orignal version. > > > > > B's scan_stat = B's scan_stat caused by B + > > > > > C's scan_stat caused by B + > > > > > > > > > > After this patch, > > > > > B's scan_stat = B's scan_stat caused by B + > > > > > B's scan_stat caused by A + > > > > > C's scan_stat caused by C + > > > > > C's scan_stat caused by B + > > > > > C's scan_stat caused by A. > > > > > > > > > > Hmm...removing hierarchy part completely seems fine to me. > > > > > > > > I see. > > > > > > > > You want to look at A and see whether its limit was responsible for > > > > reclaim scans in any children. IMO, that is asking the question > > > > backwards. Instead, there is a cgroup under reclaim and one wants to > > > > find out the cause for that. Not the other way round. > > > > > > > > In my original proposal I suggested differentiating reclaim caused by > > > > internal pressure (due to own limit) and reclaim caused by > > > > external/hierarchical pressure (due to limits from parents). > > > > > > > > If you want to find out why C is under reclaim, look at its reclaim > > > > statistics. If the _limit numbers are high, C's limit is the problem. > > > > If the _hierarchical numbers are high, the problem is B, A, or > > > > physical memory, so you check B for _limit and _hierarchical as well, > > > > then move on to A. > > > > > > > > Implementing this would be as easy as passing not only the memcg to > > > > scan (victim) to the reclaim code, but also the memcg /causing/ the > > > > reclaim (root_mem): > > > > > > > > root_mem == victim -> account to victim as _limit > > > > root_mem != victim -> account to victim as _hierarchical > > > > > > > > This would make things much simpler and more natural, both the code > > > > and the way of tracking down a problem, IMO. > > > > > > hmm. I have no strong opinion. > > > > I do :-) > > > BTW, how to calculate C's lru scan caused by A finally ? > > A > / > B > / > C > > At scanning LRU of C because of A's limit, where stats are recorded ? > > If we record it in C, we lose where the memory pressure comes from. It's recorded in C as 'scanned due to parent'. If you want to track down where pressure comes from, you check the outer container, B. If B is scanned due to internal pressure, you know that C's external pressure comes from B. If B is scanned due to external pressure, you know that B's and C's pressure comes from A or the physical memory limit (the outermost container, so to speak). The containers are nested. If C is scanned because of the limit in A, then this concerns B as well and B must be scanned as well as B, as C's usage is fully contained in B. There is not really a direct connection between C and A that is irrelevant to B, so I see no need to record in C which parent was the cause of the pressure. Just that it was /a/ parent and not itself. Then you can follow the pressure up the hierarchy tree. Answer to your original question: C_scan_due_to_A = C_scan_external - B_scan_internal - A_scan_external But IMO, having this exact number is not necessary to find the reason for why C is experiencing memory pressure in the first place, and I assume that this is the goal. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" @ 2011-08-30 11:32 ` Johannes Weiner 0 siblings, 0 replies; 54+ messages in thread From: Johannes Weiner @ 2011-08-30 11:32 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Tue, Aug 30, 2011 at 07:38:39PM +0900, KAMEZAWA Hiroyuki wrote: > On Tue, 30 Aug 2011 12:17:26 +0200 > Johannes Weiner <jweiner@redhat.com> wrote: > > > On Tue, Aug 30, 2011 at 05:56:09PM +0900, KAMEZAWA Hiroyuki wrote: > > > On Tue, 30 Aug 2011 10:42:45 +0200 > > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > > Assume 3 cgroups in a hierarchy. > > > > > > > > > > A > > > > > / > > > > > B > > > > > / > > > > > C > > > > > > > > > > C's scan contains 3 causes. > > > > > C's scan caused by limit of A. > > > > > C's scan caused by limit of B. > > > > > C's scan caused by limit of C. > > > > > > > > > > If we make hierarchy sum at read, we think > > > > > B's scan_stat = B's scan_stat + C's scan_stat > > > > > But in precice, this is > > > > > > > > > > B's scan_stat = B's scan_stat caused by B + > > > > > B's scan_stat caused by A + > > > > > C's scan_stat caused by C + > > > > > C's scan_stat caused by B + > > > > > C's scan_stat caused by A. > > > > > > > > > > In orignal version. > > > > > B's scan_stat = B's scan_stat caused by B + > > > > > C's scan_stat caused by B + > > > > > > > > > > After this patch, > > > > > B's scan_stat = B's scan_stat caused by B + > > > > > B's scan_stat caused by A + > > > > > C's scan_stat caused by C + > > > > > C's scan_stat caused by B + > > > > > C's scan_stat caused by A. > > > > > > > > > > Hmm...removing hierarchy part completely seems fine to me. > > > > > > > > I see. > > > > > > > > You want to look at A and see whether its limit was responsible for > > > > reclaim scans in any children. IMO, that is asking the question > > > > backwards. Instead, there is a cgroup under reclaim and one wants to > > > > find out the cause for that. Not the other way round. > > > > > > > > In my original proposal I suggested differentiating reclaim caused by > > > > internal pressure (due to own limit) and reclaim caused by > > > > external/hierarchical pressure (due to limits from parents). > > > > > > > > If you want to find out why C is under reclaim, look at its reclaim > > > > statistics. If the _limit numbers are high, C's limit is the problem. > > > > If the _hierarchical numbers are high, the problem is B, A, or > > > > physical memory, so you check B for _limit and _hierarchical as well, > > > > then move on to A. > > > > > > > > Implementing this would be as easy as passing not only the memcg to > > > > scan (victim) to the reclaim code, but also the memcg /causing/ the > > > > reclaim (root_mem): > > > > > > > > root_mem == victim -> account to victim as _limit > > > > root_mem != victim -> account to victim as _hierarchical > > > > > > > > This would make things much simpler and more natural, both the code > > > > and the way of tracking down a problem, IMO. > > > > > > hmm. I have no strong opinion. > > > > I do :-) > > > BTW, how to calculate C's lru scan caused by A finally ? > > A > / > B > / > C > > At scanning LRU of C because of A's limit, where stats are recorded ? > > If we record it in C, we lose where the memory pressure comes from. It's recorded in C as 'scanned due to parent'. If you want to track down where pressure comes from, you check the outer container, B. If B is scanned due to internal pressure, you know that C's external pressure comes from B. If B is scanned due to external pressure, you know that B's and C's pressure comes from A or the physical memory limit (the outermost container, so to speak). The containers are nested. If C is scanned because of the limit in A, then this concerns B as well and B must be scanned as well as B, as C's usage is fully contained in B. There is not really a direct connection between C and A that is irrelevant to B, so I see no need to record in C which parent was the cause of the pressure. Just that it was /a/ parent and not itself. Then you can follow the pressure up the hierarchy tree. Answer to your original question: C_scan_due_to_A = C_scan_external - B_scan_internal - A_scan_external But IMO, having this exact number is not necessary to find the reason for why C is experiencing memory pressure in the first place, and I assume that this is the goal. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" 2011-08-30 11:32 ` Johannes Weiner @ 2011-08-30 23:29 ` KAMEZAWA Hiroyuki -1 siblings, 0 replies; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-08-30 23:29 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Tue, 30 Aug 2011 13:32:21 +0200 Johannes Weiner <jweiner@redhat.com> wrote: > On Tue, Aug 30, 2011 at 07:38:39PM +0900, KAMEZAWA Hiroyuki wrote: > > On Tue, 30 Aug 2011 12:17:26 +0200 > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > On Tue, Aug 30, 2011 at 05:56:09PM +0900, KAMEZAWA Hiroyuki wrote: > > > > On Tue, 30 Aug 2011 10:42:45 +0200 > > > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > > > > Assume 3 cgroups in a hierarchy. > > > > > > > > > > > > A > > > > > > / > > > > > > B > > > > > > / > > > > > > C > > > > > > > > > > > > C's scan contains 3 causes. > > > > > > C's scan caused by limit of A. > > > > > > C's scan caused by limit of B. > > > > > > C's scan caused by limit of C. > > > > > > > > > > > > If we make hierarchy sum at read, we think > > > > > > B's scan_stat = B's scan_stat + C's scan_stat > > > > > > But in precice, this is > > > > > > > > > > > > B's scan_stat = B's scan_stat caused by B + > > > > > > B's scan_stat caused by A + > > > > > > C's scan_stat caused by C + > > > > > > C's scan_stat caused by B + > > > > > > C's scan_stat caused by A. > > > > > > > > > > > > In orignal version. > > > > > > B's scan_stat = B's scan_stat caused by B + > > > > > > C's scan_stat caused by B + > > > > > > > > > > > > After this patch, > > > > > > B's scan_stat = B's scan_stat caused by B + > > > > > > B's scan_stat caused by A + > > > > > > C's scan_stat caused by C + > > > > > > C's scan_stat caused by B + > > > > > > C's scan_stat caused by A. > > > > > > > > > > > > Hmm...removing hierarchy part completely seems fine to me. > > > > > > > > > > I see. > > > > > > > > > > You want to look at A and see whether its limit was responsible for > > > > > reclaim scans in any children. IMO, that is asking the question > > > > > backwards. Instead, there is a cgroup under reclaim and one wants to > > > > > find out the cause for that. Not the other way round. > > > > > > > > > > In my original proposal I suggested differentiating reclaim caused by > > > > > internal pressure (due to own limit) and reclaim caused by > > > > > external/hierarchical pressure (due to limits from parents). > > > > > > > > > > If you want to find out why C is under reclaim, look at its reclaim > > > > > statistics. If the _limit numbers are high, C's limit is the problem. > > > > > If the _hierarchical numbers are high, the problem is B, A, or > > > > > physical memory, so you check B for _limit and _hierarchical as well, > > > > > then move on to A. > > > > > > > > > > Implementing this would be as easy as passing not only the memcg to > > > > > scan (victim) to the reclaim code, but also the memcg /causing/ the > > > > > reclaim (root_mem): > > > > > > > > > > root_mem == victim -> account to victim as _limit > > > > > root_mem != victim -> account to victim as _hierarchical > > > > > > > > > > This would make things much simpler and more natural, both the code > > > > > and the way of tracking down a problem, IMO. > > > > > > > > hmm. I have no strong opinion. > > > > > > I do :-) > > > > > BTW, how to calculate C's lru scan caused by A finally ? > > > > A > > / > > B > > / > > C > > > > At scanning LRU of C because of A's limit, where stats are recorded ? > > > > If we record it in C, we lose where the memory pressure comes from. > > It's recorded in C as 'scanned due to parent'. > > If you want to track down where pressure comes from, you check the > outer container, B. If B is scanned due to internal pressure, you > know that C's external pressure comes from B. If B is scanned due to > external pressure, you know that B's and C's pressure comes from A or > the physical memory limit (the outermost container, so to speak). > > The containers are nested. If C is scanned because of the limit in A, > then this concerns B as well and B must be scanned as well as B, as > C's usage is fully contained in B. > > There is not really a direct connection between C and A that is > irrelevant to B, so I see no need to record in C which parent was the > cause of the pressure. Just that it was /a/ parent and not itself. > Then you can follow the pressure up the hierarchy tree. > > Answer to your original question: > > C_scan_due_to_A = C_scan_external - B_scan_internal - A_scan_external > I'm confused. If vmscan is scanning in C's LRU, (memcg == root) : C_scan_internal ++ (memcg != root) : C_scan_external ++ Why A_scan_external exists ? It's 0 ? I think we can never get numbers. Thanks, -Kame ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" @ 2011-08-30 23:29 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-08-30 23:29 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Tue, 30 Aug 2011 13:32:21 +0200 Johannes Weiner <jweiner@redhat.com> wrote: > On Tue, Aug 30, 2011 at 07:38:39PM +0900, KAMEZAWA Hiroyuki wrote: > > On Tue, 30 Aug 2011 12:17:26 +0200 > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > On Tue, Aug 30, 2011 at 05:56:09PM +0900, KAMEZAWA Hiroyuki wrote: > > > > On Tue, 30 Aug 2011 10:42:45 +0200 > > > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > > > > Assume 3 cgroups in a hierarchy. > > > > > > > > > > > > A > > > > > > / > > > > > > B > > > > > > / > > > > > > C > > > > > > > > > > > > C's scan contains 3 causes. > > > > > > C's scan caused by limit of A. > > > > > > C's scan caused by limit of B. > > > > > > C's scan caused by limit of C. > > > > > > > > > > > > If we make hierarchy sum at read, we think > > > > > > B's scan_stat = B's scan_stat + C's scan_stat > > > > > > But in precice, this is > > > > > > > > > > > > B's scan_stat = B's scan_stat caused by B + > > > > > > B's scan_stat caused by A + > > > > > > C's scan_stat caused by C + > > > > > > C's scan_stat caused by B + > > > > > > C's scan_stat caused by A. > > > > > > > > > > > > In orignal version. > > > > > > B's scan_stat = B's scan_stat caused by B + > > > > > > C's scan_stat caused by B + > > > > > > > > > > > > After this patch, > > > > > > B's scan_stat = B's scan_stat caused by B + > > > > > > B's scan_stat caused by A + > > > > > > C's scan_stat caused by C + > > > > > > C's scan_stat caused by B + > > > > > > C's scan_stat caused by A. > > > > > > > > > > > > Hmm...removing hierarchy part completely seems fine to me. > > > > > > > > > > I see. > > > > > > > > > > You want to look at A and see whether its limit was responsible for > > > > > reclaim scans in any children. IMO, that is asking the question > > > > > backwards. Instead, there is a cgroup under reclaim and one wants to > > > > > find out the cause for that. Not the other way round. > > > > > > > > > > In my original proposal I suggested differentiating reclaim caused by > > > > > internal pressure (due to own limit) and reclaim caused by > > > > > external/hierarchical pressure (due to limits from parents). > > > > > > > > > > If you want to find out why C is under reclaim, look at its reclaim > > > > > statistics. If the _limit numbers are high, C's limit is the problem. > > > > > If the _hierarchical numbers are high, the problem is B, A, or > > > > > physical memory, so you check B for _limit and _hierarchical as well, > > > > > then move on to A. > > > > > > > > > > Implementing this would be as easy as passing not only the memcg to > > > > > scan (victim) to the reclaim code, but also the memcg /causing/ the > > > > > reclaim (root_mem): > > > > > > > > > > root_mem == victim -> account to victim as _limit > > > > > root_mem != victim -> account to victim as _hierarchical > > > > > > > > > > This would make things much simpler and more natural, both the code > > > > > and the way of tracking down a problem, IMO. > > > > > > > > hmm. I have no strong opinion. > > > > > > I do :-) > > > > > BTW, how to calculate C's lru scan caused by A finally ? > > > > A > > / > > B > > / > > C > > > > At scanning LRU of C because of A's limit, where stats are recorded ? > > > > If we record it in C, we lose where the memory pressure comes from. > > It's recorded in C as 'scanned due to parent'. > > If you want to track down where pressure comes from, you check the > outer container, B. If B is scanned due to internal pressure, you > know that C's external pressure comes from B. If B is scanned due to > external pressure, you know that B's and C's pressure comes from A or > the physical memory limit (the outermost container, so to speak). > > The containers are nested. If C is scanned because of the limit in A, > then this concerns B as well and B must be scanned as well as B, as > C's usage is fully contained in B. > > There is not really a direct connection between C and A that is > irrelevant to B, so I see no need to record in C which parent was the > cause of the pressure. Just that it was /a/ parent and not itself. > Then you can follow the pressure up the hierarchy tree. > > Answer to your original question: > > C_scan_due_to_A = C_scan_external - B_scan_internal - A_scan_external > I'm confused. If vmscan is scanning in C's LRU, (memcg == root) : C_scan_internal ++ (memcg != root) : C_scan_external ++ Why A_scan_external exists ? It's 0 ? I think we can never get numbers. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" 2011-08-30 23:29 ` KAMEZAWA Hiroyuki @ 2011-08-31 6:23 ` Johannes Weiner -1 siblings, 0 replies; 54+ messages in thread From: Johannes Weiner @ 2011-08-31 6:23 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Wed, Aug 31, 2011 at 08:29:24AM +0900, KAMEZAWA Hiroyuki wrote: > On Tue, 30 Aug 2011 13:32:21 +0200 > Johannes Weiner <jweiner@redhat.com> wrote: > > > On Tue, Aug 30, 2011 at 07:38:39PM +0900, KAMEZAWA Hiroyuki wrote: > > > On Tue, 30 Aug 2011 12:17:26 +0200 > > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > > > On Tue, Aug 30, 2011 at 05:56:09PM +0900, KAMEZAWA Hiroyuki wrote: > > > > > On Tue, 30 Aug 2011 10:42:45 +0200 > > > > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > > > > > > Assume 3 cgroups in a hierarchy. > > > > > > > > > > > > > > A > > > > > > > / > > > > > > > B > > > > > > > / > > > > > > > C > > > > > > > > > > > > > > C's scan contains 3 causes. > > > > > > > C's scan caused by limit of A. > > > > > > > C's scan caused by limit of B. > > > > > > > C's scan caused by limit of C. > > > > > > > > > > > > > > If we make hierarchy sum at read, we think > > > > > > > B's scan_stat = B's scan_stat + C's scan_stat > > > > > > > But in precice, this is > > > > > > > > > > > > > > B's scan_stat = B's scan_stat caused by B + > > > > > > > B's scan_stat caused by A + > > > > > > > C's scan_stat caused by C + > > > > > > > C's scan_stat caused by B + > > > > > > > C's scan_stat caused by A. > > > > > > > > > > > > > > In orignal version. > > > > > > > B's scan_stat = B's scan_stat caused by B + > > > > > > > C's scan_stat caused by B + > > > > > > > > > > > > > > After this patch, > > > > > > > B's scan_stat = B's scan_stat caused by B + > > > > > > > B's scan_stat caused by A + > > > > > > > C's scan_stat caused by C + > > > > > > > C's scan_stat caused by B + > > > > > > > C's scan_stat caused by A. > > > > > > > > > > > > > > Hmm...removing hierarchy part completely seems fine to me. > > > > > > > > > > > > I see. > > > > > > > > > > > > You want to look at A and see whether its limit was responsible for > > > > > > reclaim scans in any children. IMO, that is asking the question > > > > > > backwards. Instead, there is a cgroup under reclaim and one wants to > > > > > > find out the cause for that. Not the other way round. > > > > > > > > > > > > In my original proposal I suggested differentiating reclaim caused by > > > > > > internal pressure (due to own limit) and reclaim caused by > > > > > > external/hierarchical pressure (due to limits from parents). > > > > > > > > > > > > If you want to find out why C is under reclaim, look at its reclaim > > > > > > statistics. If the _limit numbers are high, C's limit is the problem. > > > > > > If the _hierarchical numbers are high, the problem is B, A, or > > > > > > physical memory, so you check B for _limit and _hierarchical as well, > > > > > > then move on to A. > > > > > > > > > > > > Implementing this would be as easy as passing not only the memcg to > > > > > > scan (victim) to the reclaim code, but also the memcg /causing/ the > > > > > > reclaim (root_mem): > > > > > > > > > > > > root_mem == victim -> account to victim as _limit > > > > > > root_mem != victim -> account to victim as _hierarchical > > > > > > > > > > > > This would make things much simpler and more natural, both the code > > > > > > and the way of tracking down a problem, IMO. > > > > > > > > > > hmm. I have no strong opinion. > > > > > > > > I do :-) > > > > > > > BTW, how to calculate C's lru scan caused by A finally ? > > > > > > A > > > / > > > B > > > / > > > C > > > > > > At scanning LRU of C because of A's limit, where stats are recorded ? > > > > > > If we record it in C, we lose where the memory pressure comes from. > > > > It's recorded in C as 'scanned due to parent'. > > > > If you want to track down where pressure comes from, you check the > > outer container, B. If B is scanned due to internal pressure, you > > know that C's external pressure comes from B. If B is scanned due to > > external pressure, you know that B's and C's pressure comes from A or > > the physical memory limit (the outermost container, so to speak). > > > > The containers are nested. If C is scanned because of the limit in A, > > then this concerns B as well and B must be scanned as well as B, as > > C's usage is fully contained in B. > > > > There is not really a direct connection between C and A that is > > irrelevant to B, so I see no need to record in C which parent was the > > cause of the pressure. Just that it was /a/ parent and not itself. > > Then you can follow the pressure up the hierarchy tree. > > > > Answer to your original question: > > > > C_scan_due_to_A = C_scan_external - B_scan_internal - A_scan_external > > > > I'm confused. > > If vmscan is scanning in C's LRU, > (memcg == root) : C_scan_internal ++ > (memcg != root) : C_scan_external ++ Yes. > Why A_scan_external exists ? It's 0 ? > > I think we can never get numbers. Kswapd/direct reclaim should probably be accounted as A_external, since A has no limit, so reclaim pressure can not be internal. On the other hand, one could see the amount of physical memory in the machine as A's limit and account global reclaim as A_internal. I think the former may be more natural. That aside, all memcgs should have the same statistics, obviously. Scripts can easily deal with counters being zero. If items differ between cgroups, that would suck a lot. ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" @ 2011-08-31 6:23 ` Johannes Weiner 0 siblings, 0 replies; 54+ messages in thread From: Johannes Weiner @ 2011-08-31 6:23 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Wed, Aug 31, 2011 at 08:29:24AM +0900, KAMEZAWA Hiroyuki wrote: > On Tue, 30 Aug 2011 13:32:21 +0200 > Johannes Weiner <jweiner@redhat.com> wrote: > > > On Tue, Aug 30, 2011 at 07:38:39PM +0900, KAMEZAWA Hiroyuki wrote: > > > On Tue, 30 Aug 2011 12:17:26 +0200 > > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > > > On Tue, Aug 30, 2011 at 05:56:09PM +0900, KAMEZAWA Hiroyuki wrote: > > > > > On Tue, 30 Aug 2011 10:42:45 +0200 > > > > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > > > > > > Assume 3 cgroups in a hierarchy. > > > > > > > > > > > > > > A > > > > > > > / > > > > > > > B > > > > > > > / > > > > > > > C > > > > > > > > > > > > > > C's scan contains 3 causes. > > > > > > > C's scan caused by limit of A. > > > > > > > C's scan caused by limit of B. > > > > > > > C's scan caused by limit of C. > > > > > > > > > > > > > > If we make hierarchy sum at read, we think > > > > > > > B's scan_stat = B's scan_stat + C's scan_stat > > > > > > > But in precice, this is > > > > > > > > > > > > > > B's scan_stat = B's scan_stat caused by B + > > > > > > > B's scan_stat caused by A + > > > > > > > C's scan_stat caused by C + > > > > > > > C's scan_stat caused by B + > > > > > > > C's scan_stat caused by A. > > > > > > > > > > > > > > In orignal version. > > > > > > > B's scan_stat = B's scan_stat caused by B + > > > > > > > C's scan_stat caused by B + > > > > > > > > > > > > > > After this patch, > > > > > > > B's scan_stat = B's scan_stat caused by B + > > > > > > > B's scan_stat caused by A + > > > > > > > C's scan_stat caused by C + > > > > > > > C's scan_stat caused by B + > > > > > > > C's scan_stat caused by A. > > > > > > > > > > > > > > Hmm...removing hierarchy part completely seems fine to me. > > > > > > > > > > > > I see. > > > > > > > > > > > > You want to look at A and see whether its limit was responsible for > > > > > > reclaim scans in any children. IMO, that is asking the question > > > > > > backwards. Instead, there is a cgroup under reclaim and one wants to > > > > > > find out the cause for that. Not the other way round. > > > > > > > > > > > > In my original proposal I suggested differentiating reclaim caused by > > > > > > internal pressure (due to own limit) and reclaim caused by > > > > > > external/hierarchical pressure (due to limits from parents). > > > > > > > > > > > > If you want to find out why C is under reclaim, look at its reclaim > > > > > > statistics. If the _limit numbers are high, C's limit is the problem. > > > > > > If the _hierarchical numbers are high, the problem is B, A, or > > > > > > physical memory, so you check B for _limit and _hierarchical as well, > > > > > > then move on to A. > > > > > > > > > > > > Implementing this would be as easy as passing not only the memcg to > > > > > > scan (victim) to the reclaim code, but also the memcg /causing/ the > > > > > > reclaim (root_mem): > > > > > > > > > > > > root_mem == victim -> account to victim as _limit > > > > > > root_mem != victim -> account to victim as _hierarchical > > > > > > > > > > > > This would make things much simpler and more natural, both the code > > > > > > and the way of tracking down a problem, IMO. > > > > > > > > > > hmm. I have no strong opinion. > > > > > > > > I do :-) > > > > > > > BTW, how to calculate C's lru scan caused by A finally ? > > > > > > A > > > / > > > B > > > / > > > C > > > > > > At scanning LRU of C because of A's limit, where stats are recorded ? > > > > > > If we record it in C, we lose where the memory pressure comes from. > > > > It's recorded in C as 'scanned due to parent'. > > > > If you want to track down where pressure comes from, you check the > > outer container, B. If B is scanned due to internal pressure, you > > know that C's external pressure comes from B. If B is scanned due to > > external pressure, you know that B's and C's pressure comes from A or > > the physical memory limit (the outermost container, so to speak). > > > > The containers are nested. If C is scanned because of the limit in A, > > then this concerns B as well and B must be scanned as well as B, as > > C's usage is fully contained in B. > > > > There is not really a direct connection between C and A that is > > irrelevant to B, so I see no need to record in C which parent was the > > cause of the pressure. Just that it was /a/ parent and not itself. > > Then you can follow the pressure up the hierarchy tree. > > > > Answer to your original question: > > > > C_scan_due_to_A = C_scan_external - B_scan_internal - A_scan_external > > > > I'm confused. > > If vmscan is scanning in C's LRU, > (memcg == root) : C_scan_internal ++ > (memcg != root) : C_scan_external ++ Yes. > Why A_scan_external exists ? It's 0 ? > > I think we can never get numbers. Kswapd/direct reclaim should probably be accounted as A_external, since A has no limit, so reclaim pressure can not be internal. On the other hand, one could see the amount of physical memory in the machine as A's limit and account global reclaim as A_internal. I think the former may be more natural. That aside, all memcgs should have the same statistics, obviously. Scripts can easily deal with counters being zero. If items differ between cgroups, that would suck a lot. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" 2011-08-31 6:23 ` Johannes Weiner @ 2011-08-31 6:30 ` KAMEZAWA Hiroyuki -1 siblings, 0 replies; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-08-31 6:30 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Wed, 31 Aug 2011 08:23:54 +0200 Johannes Weiner <jweiner@redhat.com> wrote: > On Wed, Aug 31, 2011 at 08:29:24AM +0900, KAMEZAWA Hiroyuki wrote: > > On Tue, 30 Aug 2011 13:32:21 +0200 > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > On Tue, Aug 30, 2011 at 07:38:39PM +0900, KAMEZAWA Hiroyuki wrote: > > > > On Tue, 30 Aug 2011 12:17:26 +0200 > > > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > > > > > On Tue, Aug 30, 2011 at 05:56:09PM +0900, KAMEZAWA Hiroyuki wrote: > > > > > > On Tue, 30 Aug 2011 10:42:45 +0200 > > > > > > Johannes Weiner <jweiner@redhat.com> wrote: > > > I'm confused. > > > > If vmscan is scanning in C's LRU, > > (memcg == root) : C_scan_internal ++ > > (memcg != root) : C_scan_external ++ > > Yes. > > > Why A_scan_external exists ? It's 0 ? > > > > I think we can never get numbers. > > Kswapd/direct reclaim should probably be accounted as A_external, > since A has no limit, so reclaim pressure can not be internal. > hmm, ok. All memory pressure from memcg/system other than the memcg itsef is all external. > On the other hand, one could see the amount of physical memory in the > machine as A's limit and account global reclaim as A_internal. > > I think the former may be more natural. > > That aside, all memcgs should have the same statistics, obviously. > Scripts can easily deal with counters being zero. If items differ > between cgroups, that would suck a lot. So, when I improve direct-reclaim path, I need to see score in scan_internal. How do you think about background-reclaim-per-memcg ? Should be counted into scan_internal ? Thanks, -Kame ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" @ 2011-08-31 6:30 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 54+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-08-31 6:30 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Wed, 31 Aug 2011 08:23:54 +0200 Johannes Weiner <jweiner@redhat.com> wrote: > On Wed, Aug 31, 2011 at 08:29:24AM +0900, KAMEZAWA Hiroyuki wrote: > > On Tue, 30 Aug 2011 13:32:21 +0200 > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > On Tue, Aug 30, 2011 at 07:38:39PM +0900, KAMEZAWA Hiroyuki wrote: > > > > On Tue, 30 Aug 2011 12:17:26 +0200 > > > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > > > > > On Tue, Aug 30, 2011 at 05:56:09PM +0900, KAMEZAWA Hiroyuki wrote: > > > > > > On Tue, 30 Aug 2011 10:42:45 +0200 > > > > > > Johannes Weiner <jweiner@redhat.com> wrote: > > > I'm confused. > > > > If vmscan is scanning in C's LRU, > > (memcg == root) : C_scan_internal ++ > > (memcg != root) : C_scan_external ++ > > Yes. > > > Why A_scan_external exists ? It's 0 ? > > > > I think we can never get numbers. > > Kswapd/direct reclaim should probably be accounted as A_external, > since A has no limit, so reclaim pressure can not be internal. > hmm, ok. All memory pressure from memcg/system other than the memcg itsef is all external. > On the other hand, one could see the amount of physical memory in the > machine as A's limit and account global reclaim as A_internal. > > I think the former may be more natural. > > That aside, all memcgs should have the same statistics, obviously. > Scripts can easily deal with counters being zero. If items differ > between cgroups, that would suck a lot. So, when I improve direct-reclaim path, I need to see score in scan_internal. How do you think about background-reclaim-per-memcg ? Should be counted into scan_internal ? Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" 2011-08-31 6:30 ` KAMEZAWA Hiroyuki @ 2011-08-31 8:33 ` Johannes Weiner -1 siblings, 0 replies; 54+ messages in thread From: Johannes Weiner @ 2011-08-31 8:33 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Wed, Aug 31, 2011 at 03:30:25PM +0900, KAMEZAWA Hiroyuki wrote: > On Wed, 31 Aug 2011 08:23:54 +0200 > Johannes Weiner <jweiner@redhat.com> wrote: > > > On Wed, Aug 31, 2011 at 08:29:24AM +0900, KAMEZAWA Hiroyuki wrote: > > > On Tue, 30 Aug 2011 13:32:21 +0200 > > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > > > On Tue, Aug 30, 2011 at 07:38:39PM +0900, KAMEZAWA Hiroyuki wrote: > > > > > On Tue, 30 Aug 2011 12:17:26 +0200 > > > > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > > > > > > > On Tue, Aug 30, 2011 at 05:56:09PM +0900, KAMEZAWA Hiroyuki wrote: > > > > > > > On Tue, 30 Aug 2011 10:42:45 +0200 > > > > > > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > I'm confused. > > > > > > If vmscan is scanning in C's LRU, > > > (memcg == root) : C_scan_internal ++ > > > (memcg != root) : C_scan_external ++ > > > > Yes. > > > > > Why A_scan_external exists ? It's 0 ? > > > > > > I think we can never get numbers. > > > > Kswapd/direct reclaim should probably be accounted as A_external, > > since A has no limit, so reclaim pressure can not be internal. > > > > hmm, ok. All memory pressure from memcg/system other than the memcg itsef > is all external. > > > On the other hand, one could see the amount of physical memory in the > > machine as A's limit and account global reclaim as A_internal. > > > > I think the former may be more natural. > > > > That aside, all memcgs should have the same statistics, obviously. > > Scripts can easily deal with counters being zero. If items differ > > between cgroups, that would suck a lot. > > So, when I improve direct-reclaim path, I need to see score in scan_internal. Direct reclaim because of the limit or because of global pressure? I am going to assume because of the limit because global reclaim is not yet accounted to memcgs even though their pages are scanned. Please correct me if I'm wrong. A / B / C If A hits the limit and does direct reclaim in A, B, and C, then the scans in A get accounted as internal while the scans in B and C get accounted as external. > How do you think about background-reclaim-per-memcg ? > Should be counted into scan_internal ? Background reclaim is still triggered by the limit, just that the condition is 'close to limit' instead of 'reached limit'. So when per-memcg background reclaim goes off because A is close to its limit, then it will scan A (internal) and B + C (external). It's always the same code: record_reclaim_stat(culprit, victim, item, delta) In direct limit reclaim, the culprit is the one hitting its limit. In background reclaim, the culprit is the one getting close to its limit. And then again the accounting is culprit == victim -> victim_internal++ (own fault) culprit != victim -> victim_external++ (parent's fault) ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" @ 2011-08-31 8:33 ` Johannes Weiner 0 siblings, 0 replies; 54+ messages in thread From: Johannes Weiner @ 2011-08-31 8:33 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Ying Han, Michal Hocko, linux-mm, linux-kernel On Wed, Aug 31, 2011 at 03:30:25PM +0900, KAMEZAWA Hiroyuki wrote: > On Wed, 31 Aug 2011 08:23:54 +0200 > Johannes Weiner <jweiner@redhat.com> wrote: > > > On Wed, Aug 31, 2011 at 08:29:24AM +0900, KAMEZAWA Hiroyuki wrote: > > > On Tue, 30 Aug 2011 13:32:21 +0200 > > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > > > On Tue, Aug 30, 2011 at 07:38:39PM +0900, KAMEZAWA Hiroyuki wrote: > > > > > On Tue, 30 Aug 2011 12:17:26 +0200 > > > > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > > > > > > > On Tue, Aug 30, 2011 at 05:56:09PM +0900, KAMEZAWA Hiroyuki wrote: > > > > > > > On Tue, 30 Aug 2011 10:42:45 +0200 > > > > > > > Johannes Weiner <jweiner@redhat.com> wrote: > > > > > I'm confused. > > > > > > If vmscan is scanning in C's LRU, > > > (memcg == root) : C_scan_internal ++ > > > (memcg != root) : C_scan_external ++ > > > > Yes. > > > > > Why A_scan_external exists ? It's 0 ? > > > > > > I think we can never get numbers. > > > > Kswapd/direct reclaim should probably be accounted as A_external, > > since A has no limit, so reclaim pressure can not be internal. > > > > hmm, ok. All memory pressure from memcg/system other than the memcg itsef > is all external. > > > On the other hand, one could see the amount of physical memory in the > > machine as A's limit and account global reclaim as A_internal. > > > > I think the former may be more natural. > > > > That aside, all memcgs should have the same statistics, obviously. > > Scripts can easily deal with counters being zero. If items differ > > between cgroups, that would suck a lot. > > So, when I improve direct-reclaim path, I need to see score in scan_internal. Direct reclaim because of the limit or because of global pressure? I am going to assume because of the limit because global reclaim is not yet accounted to memcgs even though their pages are scanned. Please correct me if I'm wrong. A / B / C If A hits the limit and does direct reclaim in A, B, and C, then the scans in A get accounted as internal while the scans in B and C get accounted as external. > How do you think about background-reclaim-per-memcg ? > Should be counted into scan_internal ? Background reclaim is still triggered by the limit, just that the condition is 'close to limit' instead of 'reached limit'. So when per-memcg background reclaim goes off because A is close to its limit, then it will scan A (internal) and B + C (external). It's always the same code: record_reclaim_stat(culprit, victim, item, delta) In direct limit reclaim, the culprit is the one hitting its limit. In background reclaim, the culprit is the one getting close to its limit. And then again the accounting is culprit == victim -> victim_internal++ (own fault) culprit != victim -> victim_external++ (parent's fault) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" 2011-08-30 8:42 ` Johannes Weiner @ 2011-09-01 6:05 ` Ying Han -1 siblings, 0 replies; 54+ messages in thread From: Ying Han @ 2011-09-01 6:05 UTC (permalink / raw) To: Johannes Weiner Cc: KAMEZAWA Hiroyuki, Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Michal Hocko, linux-mm, linux-kernel On Tue, Aug 30, 2011 at 1:42 AM, Johannes Weiner <jweiner@redhat.com> wrote: > On Tue, Aug 30, 2011 at 04:20:50PM +0900, KAMEZAWA Hiroyuki wrote: >> On Tue, 30 Aug 2011 09:04:24 +0200 >> Johannes Weiner <jweiner@redhat.com> wrote: >> >> > On Tue, Aug 30, 2011 at 10:12:33AM +0900, KAMEZAWA Hiroyuki wrote: >> > > @@ -1710,11 +1711,18 @@ static void mem_cgroup_record_scanstat(s >> > > spin_lock(&memcg->scanstat.lock); >> > > __mem_cgroup_record_scanstat(memcg->scanstat.stats[context], rec); >> > > spin_unlock(&memcg->scanstat.lock); >> > > - >> > > - memcg = rec->root; >> > > - spin_lock(&memcg->scanstat.lock); >> > > - __mem_cgroup_record_scanstat(memcg->scanstat.rootstats[context], rec); >> > > - spin_unlock(&memcg->scanstat.lock); >> > > + cgroup = memcg->css.cgroup; >> > > + do { >> > > + spin_lock(&memcg->scanstat.lock); >> > > + __mem_cgroup_record_scanstat( >> > > + memcg->scanstat.hierarchy_stats[context], rec); >> > > + spin_unlock(&memcg->scanstat.lock); >> > > + if (!cgroup->parent) >> > > + break; >> > > + cgroup = cgroup->parent; >> > > + memcg = mem_cgroup_from_cont(cgroup); >> > > + } while (memcg->use_hierarchy && memcg != rec->root); >> > >> > Okay, so this looks correct, but it sums up all parents after each >> > memcg scanned, which could have a performance impact. Usually, >> > hierarchy statistics are only summed up when a user reads them. >> > >> Hmm. But sum-at-read doesn't work. >> >> Assume 3 cgroups in a hierarchy. >> >> A >> / >> B >> / >> C >> >> C's scan contains 3 causes. >> C's scan caused by limit of A. >> C's scan caused by limit of B. >> C's scan caused by limit of C. >> >> If we make hierarchy sum at read, we think >> B's scan_stat = B's scan_stat + C's scan_stat >> But in precice, this is >> >> B's scan_stat = B's scan_stat caused by B + >> B's scan_stat caused by A + >> C's scan_stat caused by C + >> C's scan_stat caused by B + >> C's scan_stat caused by A. >> >> In orignal version. >> B's scan_stat = B's scan_stat caused by B + >> C's scan_stat caused by B + >> >> After this patch, >> B's scan_stat = B's scan_stat caused by B + >> B's scan_stat caused by A + >> C's scan_stat caused by C + >> C's scan_stat caused by B + >> C's scan_stat caused by A. >> >> Hmm...removing hierarchy part completely seems fine to me. > > I see. > > You want to look at A and see whether its limit was responsible for > reclaim scans in any children. IMO, that is asking the question > backwards. Instead, there is a cgroup under reclaim and one wants to > find out the cause for that. Not the other way round. > > In my original proposal I suggested differentiating reclaim caused by > internal pressure (due to own limit) and reclaim caused by > external/hierarchical pressure (due to limits from parents). > > If you want to find out why C is under reclaim, look at its reclaim > statistics. If the _limit numbers are high, C's limit is the problem. > If the _hierarchical numbers are high, the problem is B, A, or > physical memory, so you check B for _limit and _hierarchical as well, > then move on to A. > > Implementing this would be as easy as passing not only the memcg to > scan (victim) to the reclaim code, but also the memcg /causing/ the > reclaim (root_mem): > > root_mem == victim -> account to victim as _limit > root_mem != victim -> account to victim as _hierarchical > > This would make things much simpler and more natural, both the code > and the way of tracking down a problem, IMO. This is pretty much the stats I am currently using for debugging the reclaim patches. For example: scanned_pages_by_system 0 scanned_pages_by_system_under_hierarchy 50989 scanned_pages_by_limit 0 scanned_pages_by_limit_under_hierarchy 0 "_system" is count under global reclaim, and "_limit" is count under per-memcg reclaim. "_under_hiearchy" is set if memcg is not the one triggering pressure. So in the previous example: > A (root) > / > B > / > C For cgroup C: scanned_pages_by_system: scanned_pages_by_system_under_hierarchy: # of pages scanned under global memory pressure scanned_pages_by_limit: # of pages scanned while C hits the limit scanned_pages_by_limit_under_hierarchy: # of pages scanned while B hits the limit --Ying > >> > I don't get why this has to be done completely different from the way >> > we usually do things, without any justification, whatsoever. >> > >> > Why do you want to pass a recording structure down the reclaim stack? >> >> Just for reducing number of passed variables. > > It's still sitting on bottom of the reclaim stack the whole time. > > With my proposal, you would only need to pass the extra root_mem > pointer. > ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" @ 2011-09-01 6:05 ` Ying Han 0 siblings, 0 replies; 54+ messages in thread From: Ying Han @ 2011-09-01 6:05 UTC (permalink / raw) To: Johannes Weiner Cc: KAMEZAWA Hiroyuki, Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Michal Hocko, linux-mm, linux-kernel On Tue, Aug 30, 2011 at 1:42 AM, Johannes Weiner <jweiner@redhat.com> wrote: > On Tue, Aug 30, 2011 at 04:20:50PM +0900, KAMEZAWA Hiroyuki wrote: >> On Tue, 30 Aug 2011 09:04:24 +0200 >> Johannes Weiner <jweiner@redhat.com> wrote: >> >> > On Tue, Aug 30, 2011 at 10:12:33AM +0900, KAMEZAWA Hiroyuki wrote: >> > > @@ -1710,11 +1711,18 @@ static void mem_cgroup_record_scanstat(s >> > > spin_lock(&memcg->scanstat.lock); >> > > __mem_cgroup_record_scanstat(memcg->scanstat.stats[context], rec); >> > > spin_unlock(&memcg->scanstat.lock); >> > > - >> > > - memcg = rec->root; >> > > - spin_lock(&memcg->scanstat.lock); >> > > - __mem_cgroup_record_scanstat(memcg->scanstat.rootstats[context], rec); >> > > - spin_unlock(&memcg->scanstat.lock); >> > > + cgroup = memcg->css.cgroup; >> > > + do { >> > > + spin_lock(&memcg->scanstat.lock); >> > > + __mem_cgroup_record_scanstat( >> > > + memcg->scanstat.hierarchy_stats[context], rec); >> > > + spin_unlock(&memcg->scanstat.lock); >> > > + if (!cgroup->parent) >> > > + break; >> > > + cgroup = cgroup->parent; >> > > + memcg = mem_cgroup_from_cont(cgroup); >> > > + } while (memcg->use_hierarchy && memcg != rec->root); >> > >> > Okay, so this looks correct, but it sums up all parents after each >> > memcg scanned, which could have a performance impact. Usually, >> > hierarchy statistics are only summed up when a user reads them. >> > >> Hmm. But sum-at-read doesn't work. >> >> Assume 3 cgroups in a hierarchy. >> >> A >> / >> B >> / >> C >> >> C's scan contains 3 causes. >> C's scan caused by limit of A. >> C's scan caused by limit of B. >> C's scan caused by limit of C. >> >> If we make hierarchy sum at read, we think >> B's scan_stat = B's scan_stat + C's scan_stat >> But in precice, this is >> >> B's scan_stat = B's scan_stat caused by B + >> B's scan_stat caused by A + >> C's scan_stat caused by C + >> C's scan_stat caused by B + >> C's scan_stat caused by A. >> >> In orignal version. >> B's scan_stat = B's scan_stat caused by B + >> C's scan_stat caused by B + >> >> After this patch, >> B's scan_stat = B's scan_stat caused by B + >> B's scan_stat caused by A + >> C's scan_stat caused by C + >> C's scan_stat caused by B + >> C's scan_stat caused by A. >> >> Hmm...removing hierarchy part completely seems fine to me. > > I see. > > You want to look at A and see whether its limit was responsible for > reclaim scans in any children. IMO, that is asking the question > backwards. Instead, there is a cgroup under reclaim and one wants to > find out the cause for that. Not the other way round. > > In my original proposal I suggested differentiating reclaim caused by > internal pressure (due to own limit) and reclaim caused by > external/hierarchical pressure (due to limits from parents). > > If you want to find out why C is under reclaim, look at its reclaim > statistics. If the _limit numbers are high, C's limit is the problem. > If the _hierarchical numbers are high, the problem is B, A, or > physical memory, so you check B for _limit and _hierarchical as well, > then move on to A. > > Implementing this would be as easy as passing not only the memcg to > scan (victim) to the reclaim code, but also the memcg /causing/ the > reclaim (root_mem): > > root_mem == victim -> account to victim as _limit > root_mem != victim -> account to victim as _hierarchical > > This would make things much simpler and more natural, both the code > and the way of tracking down a problem, IMO. This is pretty much the stats I am currently using for debugging the reclaim patches. For example: scanned_pages_by_system 0 scanned_pages_by_system_under_hierarchy 50989 scanned_pages_by_limit 0 scanned_pages_by_limit_under_hierarchy 0 "_system" is count under global reclaim, and "_limit" is count under per-memcg reclaim. "_under_hiearchy" is set if memcg is not the one triggering pressure. So in the previous example: > A (root) > / > B > / > C For cgroup C: scanned_pages_by_system: scanned_pages_by_system_under_hierarchy: # of pages scanned under global memory pressure scanned_pages_by_limit: # of pages scanned while C hits the limit scanned_pages_by_limit_under_hierarchy: # of pages scanned while B hits the limit --Ying > >> > I don't get why this has to be done completely different from the way >> > we usually do things, without any justification, whatsoever. >> > >> > Why do you want to pass a recording structure down the reclaim stack? >> >> Just for reducing number of passed variables. > > It's still sitting on bottom of the reclaim stack the whole time. > > With my proposal, you would only need to pass the extra root_mem > pointer. > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" 2011-09-01 6:05 ` Ying Han @ 2011-09-01 6:40 ` Johannes Weiner -1 siblings, 0 replies; 54+ messages in thread From: Johannes Weiner @ 2011-09-01 6:40 UTC (permalink / raw) To: Ying Han Cc: KAMEZAWA Hiroyuki, Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Michal Hocko, linux-mm, linux-kernel On Wed, Aug 31, 2011 at 11:05:51PM -0700, Ying Han wrote: > On Tue, Aug 30, 2011 at 1:42 AM, Johannes Weiner <jweiner@redhat.com> wrote: > > You want to look at A and see whether its limit was responsible for > > reclaim scans in any children. IMO, that is asking the question > > backwards. Instead, there is a cgroup under reclaim and one wants to > > find out the cause for that. Not the other way round. > > > > In my original proposal I suggested differentiating reclaim caused by > > internal pressure (due to own limit) and reclaim caused by > > external/hierarchical pressure (due to limits from parents). > > > > If you want to find out why C is under reclaim, look at its reclaim > > statistics. If the _limit numbers are high, C's limit is the problem. > > If the _hierarchical numbers are high, the problem is B, A, or > > physical memory, so you check B for _limit and _hierarchical as well, > > then move on to A. > > > > Implementing this would be as easy as passing not only the memcg to > > scan (victim) to the reclaim code, but also the memcg /causing/ the > > reclaim (root_mem): > > > > root_mem == victim -> account to victim as _limit > > root_mem != victim -> account to victim as _hierarchical > > > > This would make things much simpler and more natural, both the code > > and the way of tracking down a problem, IMO. > > This is pretty much the stats I am currently using for debugging the > reclaim patches. For example: > > scanned_pages_by_system 0 > scanned_pages_by_system_under_hierarchy 50989 > > scanned_pages_by_limit 0 > scanned_pages_by_limit_under_hierarchy 0 > > "_system" is count under global reclaim, and "_limit" is count under > per-memcg reclaim. > "_under_hiearchy" is set if memcg is not the one triggering pressure. I don't get this distinction between _system and _limit. How is it orthogonal to _limit vs. _hierarchy, i.e. internal vs. external? If the system scans memcgs then no limit is at fault. It's just external pressure. For example, what is the distinction between scanned_pages_by_system and scanned_pages_by_system_under_hierarchy? The reason for scanned_pages_by_system would be, per your definition, neither due to the limit (_by_system -> global reclaim) nor not due to the limit (!_under_hierarchy -> memcg is the one triggering pressure) ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" @ 2011-09-01 6:40 ` Johannes Weiner 0 siblings, 0 replies; 54+ messages in thread From: Johannes Weiner @ 2011-09-01 6:40 UTC (permalink / raw) To: Ying Han Cc: KAMEZAWA Hiroyuki, Andrew Morton, Daisuke Nishimura, Balbir Singh, Andrew Brestic, Michal Hocko, linux-mm, linux-kernel On Wed, Aug 31, 2011 at 11:05:51PM -0700, Ying Han wrote: > On Tue, Aug 30, 2011 at 1:42 AM, Johannes Weiner <jweiner@redhat.com> wrote: > > You want to look at A and see whether its limit was responsible for > > reclaim scans in any children. IMO, that is asking the question > > backwards. Instead, there is a cgroup under reclaim and one wants to > > find out the cause for that. Not the other way round. > > > > In my original proposal I suggested differentiating reclaim caused by > > internal pressure (due to own limit) and reclaim caused by > > external/hierarchical pressure (due to limits from parents). > > > > If you want to find out why C is under reclaim, look at its reclaim > > statistics. If the _limit numbers are high, C's limit is the problem. > > If the _hierarchical numbers are high, the problem is B, A, or > > physical memory, so you check B for _limit and _hierarchical as well, > > then move on to A. > > > > Implementing this would be as easy as passing not only the memcg to > > scan (victim) to the reclaim code, but also the memcg /causing/ the > > reclaim (root_mem): > > > > root_mem == victim -> account to victim as _limit > > root_mem != victim -> account to victim as _hierarchical > > > > This would make things much simpler and more natural, both the code > > and the way of tracking down a problem, IMO. > > This is pretty much the stats I am currently using for debugging the > reclaim patches. For example: > > scanned_pages_by_system 0 > scanned_pages_by_system_under_hierarchy 50989 > > scanned_pages_by_limit 0 > scanned_pages_by_limit_under_hierarchy 0 > > "_system" is count under global reclaim, and "_limit" is count under > per-memcg reclaim. > "_under_hiearchy" is set if memcg is not the one triggering pressure. I don't get this distinction between _system and _limit. How is it orthogonal to _limit vs. _hierarchy, i.e. internal vs. external? If the system scans memcgs then no limit is at fault. It's just external pressure. For example, what is the distinction between scanned_pages_by_system and scanned_pages_by_system_under_hierarchy? The reason for scanned_pages_by_system would be, per your definition, neither due to the limit (_by_system -> global reclaim) nor not due to the limit (!_under_hierarchy -> memcg is the one triggering pressure) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" 2011-09-01 6:40 ` Johannes Weiner @ 2011-09-01 7:04 ` Ying Han -1 siblings, 0 replies; 54+ messages in thread From: Ying Han @ 2011-09-01 7:04 UTC (permalink / raw) To: Johannes Weiner Cc: KAMEZAWA Hiroyuki, Andrew Morton, Daisuke Nishimura, Balbir Singh, Michal Hocko, linux-mm, linux-kernel On Wed, Aug 31, 2011 at 11:40 PM, Johannes Weiner <jweiner@redhat.com> wrote: > On Wed, Aug 31, 2011 at 11:05:51PM -0700, Ying Han wrote: >> On Tue, Aug 30, 2011 at 1:42 AM, Johannes Weiner <jweiner@redhat.com> wrote: >> > You want to look at A and see whether its limit was responsible for >> > reclaim scans in any children. IMO, that is asking the question >> > backwards. Instead, there is a cgroup under reclaim and one wants to >> > find out the cause for that. Not the other way round. >> > >> > In my original proposal I suggested differentiating reclaim caused by >> > internal pressure (due to own limit) and reclaim caused by >> > external/hierarchical pressure (due to limits from parents). >> > >> > If you want to find out why C is under reclaim, look at its reclaim >> > statistics. If the _limit numbers are high, C's limit is the problem. >> > If the _hierarchical numbers are high, the problem is B, A, or >> > physical memory, so you check B for _limit and _hierarchical as well, >> > then move on to A. >> > >> > Implementing this would be as easy as passing not only the memcg to >> > scan (victim) to the reclaim code, but also the memcg /causing/ the >> > reclaim (root_mem): >> > >> > root_mem == victim -> account to victim as _limit >> > root_mem != victim -> account to victim as _hierarchical >> > >> > This would make things much simpler and more natural, both the code >> > and the way of tracking down a problem, IMO. >> >> This is pretty much the stats I am currently using for debugging the >> reclaim patches. For example: >> >> scanned_pages_by_system 0 >> scanned_pages_by_system_under_hierarchy 50989 >> >> scanned_pages_by_limit 0 >> scanned_pages_by_limit_under_hierarchy 0 >> >> "_system" is count under global reclaim, and "_limit" is count under >> per-memcg reclaim. >> "_under_hiearchy" is set if memcg is not the one triggering pressure. > > I don't get this distinction between _system and _limit. How is it > orthogonal to _limit vs. _hierarchy, i.e. internal vs. external? Something like : +enum mem_cgroup_scan_context { + SCAN_BY_SYSTEM, + SCAN_BY_SYSTEM_UNDER_HIERARCHY, + SCAN_BY_LIMIT, + SCAN_BY_LIMIT_UNDER_HIERARCHY, + NR_SCAN_CONTEXT, +}; if (global_reclaim(sc)) context = scan_by_system else context = scan_by_limit if (target != mem) context++; > > If the system scans memcgs then no limit is at fault. It's just > external pressure. > > For example, what is the distinction between scanned_pages_by_system > and scanned_pages_by_system_under_hierarchy? you are right about this, there is no much difference on these since it is counting global reclaim and everyone is under_hierarchy except root_cgroup. For root cgroup, it is counted in "_system". (internal) The reason for scanned_pages_by_system would be, per your definition, neither due to > the limit (_by_system -> global reclaim) nor not due to the limit > (!_under_hierarchy -> memcg is the one triggering pressure) This value "scanned_pages_by_system" only making senses for root cgroup, which now could be counted as "# of pages scanned in root lru under global reclaim". --Ying ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" @ 2011-09-01 7:04 ` Ying Han 0 siblings, 0 replies; 54+ messages in thread From: Ying Han @ 2011-09-01 7:04 UTC (permalink / raw) To: Johannes Weiner Cc: KAMEZAWA Hiroyuki, Andrew Morton, Daisuke Nishimura, Balbir Singh, Michal Hocko, linux-mm, linux-kernel On Wed, Aug 31, 2011 at 11:40 PM, Johannes Weiner <jweiner@redhat.com> wrote: > On Wed, Aug 31, 2011 at 11:05:51PM -0700, Ying Han wrote: >> On Tue, Aug 30, 2011 at 1:42 AM, Johannes Weiner <jweiner@redhat.com> wrote: >> > You want to look at A and see whether its limit was responsible for >> > reclaim scans in any children. IMO, that is asking the question >> > backwards. Instead, there is a cgroup under reclaim and one wants to >> > find out the cause for that. Not the other way round. >> > >> > In my original proposal I suggested differentiating reclaim caused by >> > internal pressure (due to own limit) and reclaim caused by >> > external/hierarchical pressure (due to limits from parents). >> > >> > If you want to find out why C is under reclaim, look at its reclaim >> > statistics. If the _limit numbers are high, C's limit is the problem. >> > If the _hierarchical numbers are high, the problem is B, A, or >> > physical memory, so you check B for _limit and _hierarchical as well, >> > then move on to A. >> > >> > Implementing this would be as easy as passing not only the memcg to >> > scan (victim) to the reclaim code, but also the memcg /causing/ the >> > reclaim (root_mem): >> > >> > root_mem == victim -> account to victim as _limit >> > root_mem != victim -> account to victim as _hierarchical >> > >> > This would make things much simpler and more natural, both the code >> > and the way of tracking down a problem, IMO. >> >> This is pretty much the stats I am currently using for debugging the >> reclaim patches. For example: >> >> scanned_pages_by_system 0 >> scanned_pages_by_system_under_hierarchy 50989 >> >> scanned_pages_by_limit 0 >> scanned_pages_by_limit_under_hierarchy 0 >> >> "_system" is count under global reclaim, and "_limit" is count under >> per-memcg reclaim. >> "_under_hiearchy" is set if memcg is not the one triggering pressure. > > I don't get this distinction between _system and _limit. How is it > orthogonal to _limit vs. _hierarchy, i.e. internal vs. external? Something like : +enum mem_cgroup_scan_context { + SCAN_BY_SYSTEM, + SCAN_BY_SYSTEM_UNDER_HIERARCHY, + SCAN_BY_LIMIT, + SCAN_BY_LIMIT_UNDER_HIERARCHY, + NR_SCAN_CONTEXT, +}; if (global_reclaim(sc)) context = scan_by_system else context = scan_by_limit if (target != mem) context++; > > If the system scans memcgs then no limit is at fault. It's just > external pressure. > > For example, what is the distinction between scanned_pages_by_system > and scanned_pages_by_system_under_hierarchy? you are right about this, there is no much difference on these since it is counting global reclaim and everyone is under_hierarchy except root_cgroup. For root cgroup, it is counted in "_system". (internal) The reason for scanned_pages_by_system would be, per your definition, neither due to > the limit (_by_system -> global reclaim) nor not due to the limit > (!_under_hierarchy -> memcg is the one triggering pressure) This value "scanned_pages_by_system" only making senses for root cgroup, which now could be counted as "# of pages scanned in root lru under global reclaim". --Ying -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" 2011-09-01 7:04 ` Ying Han @ 2011-09-01 8:27 ` Johannes Weiner -1 siblings, 0 replies; 54+ messages in thread From: Johannes Weiner @ 2011-09-01 8:27 UTC (permalink / raw) To: Ying Han Cc: KAMEZAWA Hiroyuki, Andrew Morton, Daisuke Nishimura, Balbir Singh, Michal Hocko, linux-mm, linux-kernel On Thu, Sep 01, 2011 at 12:04:24AM -0700, Ying Han wrote: > On Wed, Aug 31, 2011 at 11:40 PM, Johannes Weiner <jweiner@redhat.com> wrote: > > On Wed, Aug 31, 2011 at 11:05:51PM -0700, Ying Han wrote: > >> On Tue, Aug 30, 2011 at 1:42 AM, Johannes Weiner <jweiner@redhat.com> wrote: > >> > You want to look at A and see whether its limit was responsible for > >> > reclaim scans in any children. IMO, that is asking the question > >> > backwards. Instead, there is a cgroup under reclaim and one wants to > >> > find out the cause for that. Not the other way round. > >> > > >> > In my original proposal I suggested differentiating reclaim caused by > >> > internal pressure (due to own limit) and reclaim caused by > >> > external/hierarchical pressure (due to limits from parents). > >> > > >> > If you want to find out why C is under reclaim, look at its reclaim > >> > statistics. If the _limit numbers are high, C's limit is the problem. > >> > If the _hierarchical numbers are high, the problem is B, A, or > >> > physical memory, so you check B for _limit and _hierarchical as well, > >> > then move on to A. > >> > > >> > Implementing this would be as easy as passing not only the memcg to > >> > scan (victim) to the reclaim code, but also the memcg /causing/ the > >> > reclaim (root_mem): > >> > > >> > root_mem == victim -> account to victim as _limit > >> > root_mem != victim -> account to victim as _hierarchical > >> > > >> > This would make things much simpler and more natural, both the code > >> > and the way of tracking down a problem, IMO. > >> > >> This is pretty much the stats I am currently using for debugging the > >> reclaim patches. For example: > >> > >> scanned_pages_by_system 0 > >> scanned_pages_by_system_under_hierarchy 50989 > >> > >> scanned_pages_by_limit 0 > >> scanned_pages_by_limit_under_hierarchy 0 > >> > >> "_system" is count under global reclaim, and "_limit" is count under > >> per-memcg reclaim. > >> "_under_hiearchy" is set if memcg is not the one triggering pressure. > > > > I don't get this distinction between _system and _limit. How is it > > orthogonal to _limit vs. _hierarchy, i.e. internal vs. external? > > Something like : > > +enum mem_cgroup_scan_context { > + SCAN_BY_SYSTEM, > + SCAN_BY_SYSTEM_UNDER_HIERARCHY, > + SCAN_BY_LIMIT, > + SCAN_BY_LIMIT_UNDER_HIERARCHY, > + NR_SCAN_CONTEXT, > +}; > > if (global_reclaim(sc)) > context = scan_by_system > else > context = scan_by_limit > > if (target != mem) > context++; I understand what you count, just not why. If we just had SCAN_LIMIT SCAN_HIERARCHY wouldn't it be able to convey all that is necessary? Global pressure is just hierarchical pressure, it comes from the outermost 'container' that is the machine itself. If you have one just memcg, SCAN_LIMIT shows reclaim pressure because of the limit and SCAN_HIERARCHY shows global pressure. With a hierarchical setup, you can find pressure either in SCAN_LIMIT or by looking at SCAN_HIERARCHY and recursively check the parent. root_mem_cgroup / A / B Where is the difference for B whether outside pressure is coming from physical memory limitations or the limit in A? The problem is not in B, you have to check the parents anyway. Or put differently: root_mem_cgroup / A / B / C In C, you would account global pressure separately but would not make a distinction between pressure from A's limit and pressure from B's limit. What makes the physical memory limit special that requires the resulting reclaims to be designated over reclaims due to other hierarchical limits? ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [patch] Revert "memcg: add memory.vmscan_stat" @ 2011-09-01 8:27 ` Johannes Weiner 0 siblings, 0 replies; 54+ messages in thread From: Johannes Weiner @ 2011-09-01 8:27 UTC (permalink / raw) To: Ying Han Cc: KAMEZAWA Hiroyuki, Andrew Morton, Daisuke Nishimura, Balbir Singh, Michal Hocko, linux-mm, linux-kernel On Thu, Sep 01, 2011 at 12:04:24AM -0700, Ying Han wrote: > On Wed, Aug 31, 2011 at 11:40 PM, Johannes Weiner <jweiner@redhat.com> wrote: > > On Wed, Aug 31, 2011 at 11:05:51PM -0700, Ying Han wrote: > >> On Tue, Aug 30, 2011 at 1:42 AM, Johannes Weiner <jweiner@redhat.com> wrote: > >> > You want to look at A and see whether its limit was responsible for > >> > reclaim scans in any children. IMO, that is asking the question > >> > backwards. Instead, there is a cgroup under reclaim and one wants to > >> > find out the cause for that. Not the other way round. > >> > > >> > In my original proposal I suggested differentiating reclaim caused by > >> > internal pressure (due to own limit) and reclaim caused by > >> > external/hierarchical pressure (due to limits from parents). > >> > > >> > If you want to find out why C is under reclaim, look at its reclaim > >> > statistics. If the _limit numbers are high, C's limit is the problem. > >> > If the _hierarchical numbers are high, the problem is B, A, or > >> > physical memory, so you check B for _limit and _hierarchical as well, > >> > then move on to A. > >> > > >> > Implementing this would be as easy as passing not only the memcg to > >> > scan (victim) to the reclaim code, but also the memcg /causing/ the > >> > reclaim (root_mem): > >> > > >> > root_mem == victim -> account to victim as _limit > >> > root_mem != victim -> account to victim as _hierarchical > >> > > >> > This would make things much simpler and more natural, both the code > >> > and the way of tracking down a problem, IMO. > >> > >> This is pretty much the stats I am currently using for debugging the > >> reclaim patches. For example: > >> > >> scanned_pages_by_system 0 > >> scanned_pages_by_system_under_hierarchy 50989 > >> > >> scanned_pages_by_limit 0 > >> scanned_pages_by_limit_under_hierarchy 0 > >> > >> "_system" is count under global reclaim, and "_limit" is count under > >> per-memcg reclaim. > >> "_under_hiearchy" is set if memcg is not the one triggering pressure. > > > > I don't get this distinction between _system and _limit. How is it > > orthogonal to _limit vs. _hierarchy, i.e. internal vs. external? > > Something like : > > +enum mem_cgroup_scan_context { > + SCAN_BY_SYSTEM, > + SCAN_BY_SYSTEM_UNDER_HIERARCHY, > + SCAN_BY_LIMIT, > + SCAN_BY_LIMIT_UNDER_HIERARCHY, > + NR_SCAN_CONTEXT, > +}; > > if (global_reclaim(sc)) > context = scan_by_system > else > context = scan_by_limit > > if (target != mem) > context++; I understand what you count, just not why. If we just had SCAN_LIMIT SCAN_HIERARCHY wouldn't it be able to convey all that is necessary? Global pressure is just hierarchical pressure, it comes from the outermost 'container' that is the machine itself. If you have one just memcg, SCAN_LIMIT shows reclaim pressure because of the limit and SCAN_HIERARCHY shows global pressure. With a hierarchical setup, you can find pressure either in SCAN_LIMIT or by looking at SCAN_HIERARCHY and recursively check the parent. root_mem_cgroup / A / B Where is the difference for B whether outside pressure is coming from physical memory limitations or the limit in A? The problem is not in B, you have to check the parents anyway. Or put differently: root_mem_cgroup / A / B / C In C, you would account global pressure separately but would not make a distinction between pressure from A's limit and pressure from B's limit. What makes the physical memory limit special that requires the resulting reclaims to be designated over reclaims due to other hierarchical limits? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 54+ messages in thread
end of thread, other threads:[~2011-09-01 8:28 UTC | newest] Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-07-22 8:15 [PATCH v3] memcg: add memory.vmscan_stat KAMEZAWA Hiroyuki 2011-07-22 8:15 ` KAMEZAWA Hiroyuki 2011-08-08 12:43 ` Johannes Weiner 2011-08-08 12:43 ` Johannes Weiner 2011-08-08 23:33 ` KAMEZAWA Hiroyuki 2011-08-08 23:33 ` KAMEZAWA Hiroyuki 2011-08-09 8:01 ` Johannes Weiner 2011-08-09 8:01 ` Johannes Weiner 2011-08-09 8:01 ` KAMEZAWA Hiroyuki 2011-08-09 8:01 ` KAMEZAWA Hiroyuki 2011-08-13 1:04 ` Ying Han 2011-08-13 1:04 ` Ying Han 2011-08-29 15:51 ` [patch] Revert "memcg: add memory.vmscan_stat" Johannes Weiner 2011-08-29 15:51 ` Johannes Weiner 2011-08-30 1:12 ` KAMEZAWA Hiroyuki 2011-08-30 1:12 ` KAMEZAWA Hiroyuki 2011-08-30 7:04 ` Johannes Weiner 2011-08-30 7:04 ` Johannes Weiner 2011-08-30 7:20 ` KAMEZAWA Hiroyuki 2011-08-30 7:20 ` KAMEZAWA Hiroyuki 2011-08-30 7:35 ` KAMEZAWA Hiroyuki 2011-08-30 7:35 ` KAMEZAWA Hiroyuki 2011-08-30 8:42 ` Johannes Weiner 2011-08-30 8:42 ` Johannes Weiner 2011-08-30 8:56 ` KAMEZAWA Hiroyuki 2011-08-30 8:56 ` KAMEZAWA Hiroyuki 2011-08-30 10:17 ` Johannes Weiner 2011-08-30 10:17 ` Johannes Weiner 2011-08-30 10:34 ` KAMEZAWA Hiroyuki 2011-08-30 10:34 ` KAMEZAWA Hiroyuki 2011-08-30 11:03 ` Johannes Weiner 2011-08-30 11:03 ` Johannes Weiner 2011-08-30 23:38 ` KAMEZAWA Hiroyuki 2011-08-30 23:38 ` KAMEZAWA Hiroyuki 2011-08-30 10:38 ` KAMEZAWA Hiroyuki 2011-08-30 10:38 ` KAMEZAWA Hiroyuki 2011-08-30 11:32 ` Johannes Weiner 2011-08-30 11:32 ` Johannes Weiner 2011-08-30 23:29 ` KAMEZAWA Hiroyuki 2011-08-30 23:29 ` KAMEZAWA Hiroyuki 2011-08-31 6:23 ` Johannes Weiner 2011-08-31 6:23 ` Johannes Weiner 2011-08-31 6:30 ` KAMEZAWA Hiroyuki 2011-08-31 6:30 ` KAMEZAWA Hiroyuki 2011-08-31 8:33 ` Johannes Weiner 2011-08-31 8:33 ` Johannes Weiner 2011-09-01 6:05 ` Ying Han 2011-09-01 6:05 ` Ying Han 2011-09-01 6:40 ` Johannes Weiner 2011-09-01 6:40 ` Johannes Weiner 2011-09-01 7:04 ` Ying Han 2011-09-01 7:04 ` Ying Han 2011-09-01 8:27 ` Johannes Weiner 2011-09-01 8:27 ` Johannes Weiner
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.