* [PATCH] mm: do not drain pagevecs for mlock @ 2011-12-30 6:36 Tao Ma 2011-12-30 8:11 ` KOSAKI Motohiro 0 siblings, 1 reply; 28+ messages in thread From: Tao Ma @ 2011-12-30 6:36 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, David Rientjes, Minchan Kim, KOSAKI Motohiro, Mel Gorman, Johannes Weiner, Andrew Morton In our test of mlock, we have found some severe performance regression in it. Some more investigations show that mlocked is blocked heavily by lur_add_drain_all which calls schedule_on_each_cpu and flush the work queue which is very slower if we have several cpus. So we have tried 2 ways to solve it: 1. Add a per cpu counter for all the pagevecs so that we don't schedule and flush the lru_drain work if the cpu doesn't have any pagevecs(I have finished the codes already). 2. Remove the lru_add_drain_all. The first one has some problems since in our product system, all the cpus are busy, so I guess there is very little chance for a cpu to have 0 pagevecs except that you run several consecutive mlocks. >From the commit log which added this function(8891d6da), it seems that we don't have to call it. So the 2nd one seems to be both easy and workable and comes this patch. Thanks Tao >From 8cdf7f7ed236367e85151db65ae06f781aca7d77 Mon Sep 17 00:00:00 2001 From: Tao Ma <boyu.mt@taobao.com> Date: Fri, 30 Dec 2011 14:20:08 +0800 Subject: [PATCH] mm: do not drain pagevecs for mlock In 8891d6da, lru_add_drain_all is added to mlock to flush all the per cpu pagevecs. It makes this system call runs much slower than the predecessor(For a 16 core Xeon E5620, it is around 20 times). And the the more cores we have, the more the performance penalty because of the nasty call to schedule_on_each_cpu. >From the commit log of 8891d6da we can see that "it isn't must. but it reduce the failure of moving to unevictable list. its failure can rescue in vmscan later." Christoph Lameter removes the call in mlockall(ML_FUTURE), So this patch just removes all the call from mlock/mlockall. Without this patch: time ./test_mlock -c 100000 real 0m20.566s user 0m0.074s sys 0m12.759s With this patch: time ./test_mlock -c 100000 real 0m1.675s user 0m0.049s sys 0m1.622s Cc: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Tao Ma <boyu.mt@taobao.com> --- mm/mlock.c | 5 ----- 1 files changed, 0 insertions(+), 5 deletions(-) diff --git a/mm/mlock.c b/mm/mlock.c index 4f4f53b..bb5fc42 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -487,8 +487,6 @@ SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len) if (!can_do_mlock()) return -EPERM; - lru_add_drain_all(); /* flush pagevec */ - down_write(¤t->mm->mmap_sem); len = PAGE_ALIGN(len + (start & ~PAGE_MASK)); start &= PAGE_MASK; @@ -557,9 +555,6 @@ SYSCALL_DEFINE1(mlockall, int, flags) if (!can_do_mlock()) goto out; - if (flags & MCL_CURRENT) - lru_add_drain_all(); /* flush pagevec */ - down_write(¤t->mm->mmap_sem); lock_limit = rlimit(RLIMIT_MEMLOCK); -- 1.7.4.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: do not drain pagevecs for mlock 2011-12-30 6:36 [PATCH] mm: do not drain pagevecs for mlock Tao Ma @ 2011-12-30 8:11 ` KOSAKI Motohiro 2011-12-30 8:48 ` Tao Ma 0 siblings, 1 reply; 28+ messages in thread From: KOSAKI Motohiro @ 2011-12-30 8:11 UTC (permalink / raw) To: Tao Ma Cc: linux-mm, linux-kernel, David Rientjes, Minchan Kim, Mel Gorman, Johannes Weiner, Andrew Morton 2011/12/30 Tao Ma <tm@tao.ma>: > In our test of mlock, we have found some severe performance regression > in it. Some more investigations show that mlocked is blocked heavily > by lur_add_drain_all which calls schedule_on_each_cpu and flush the work > queue which is very slower if we have several cpus. > > So we have tried 2 ways to solve it: > 1. Add a per cpu counter for all the pagevecs so that we don't schedule > and flush the lru_drain work if the cpu doesn't have any pagevecs(I > have finished the codes already). > 2. Remove the lru_add_drain_all. > > The first one has some problems since in our product system, all the cpus > are busy, so I guess there is very little chance for a cpu to have 0 pagevecs > except that you run several consecutive mlocks. > > From the commit log which added this function(8891d6da), it seems that we > don't have to call it. So the 2nd one seems to be both easy and workable and > comes this patch. Could you please show us your system environment and benchmark programs? Usually lru_drain_** is very fast than mlock() body because it makes plenty memset(page). ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: do not drain pagevecs for mlock 2011-12-30 8:11 ` KOSAKI Motohiro @ 2011-12-30 8:48 ` Tao Ma 2011-12-30 9:31 ` KOSAKI Motohiro 0 siblings, 1 reply; 28+ messages in thread From: Tao Ma @ 2011-12-30 8:48 UTC (permalink / raw) To: KOSAKI Motohiro Cc: linux-mm, linux-kernel, David Rientjes, Minchan Kim, Mel Gorman, Johannes Weiner, Andrew Morton [-- Attachment #1: Type: text/plain, Size: 1336 bytes --] On 12/30/2011 04:11 PM, KOSAKI Motohiro wrote: > 2011/12/30 Tao Ma <tm@tao.ma>: >> In our test of mlock, we have found some severe performance regression >> in it. Some more investigations show that mlocked is blocked heavily >> by lur_add_drain_all which calls schedule_on_each_cpu and flush the work >> queue which is very slower if we have several cpus. >> >> So we have tried 2 ways to solve it: >> 1. Add a per cpu counter for all the pagevecs so that we don't schedule >> and flush the lru_drain work if the cpu doesn't have any pagevecs(I >> have finished the codes already). >> 2. Remove the lru_add_drain_all. >> >> The first one has some problems since in our product system, all the cpus >> are busy, so I guess there is very little chance for a cpu to have 0 pagevecs >> except that you run several consecutive mlocks. >> >> From the commit log which added this function(8891d6da), it seems that we >> don't have to call it. So the 2nd one seems to be both easy and workable and >> comes this patch. > > Could you please show us your system environment and benchmark programs? > Usually lru_drain_** is very fast than mlock() body because it makes > plenty memset(page). The system environment is: 16 core Xeon E5620. 24G memory. I have attached the program. It is very simple and just uses mlock/munlock. Thanks Tao [-- Attachment #2: test_mlock.c --] [-- Type: text/x-csrc, Size: 981 bytes --] #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <errno.h> #include <time.h> #include <sys/time.h> #include <sys/mman.h> #define MM_SZ1 24 #define MM_SZ2 56 #define MM_SZ3 4168 void mlock_test() { char ptr1[MM_SZ1]; char ptr2[MM_SZ2]; char ptr3[MM_SZ3]; if(0 != mlock(ptr1, MM_SZ1) ) perror("mlock MM_SZ1\n"); if(0 != mlock(ptr2, MM_SZ2) ) perror("mlock MM_SZ2\n"); if(0 != mlock(ptr3, MM_SZ3) ) perror("mlock MM_SZ3\n"); if(0 != munlock(ptr1, MM_SZ1) ) perror("munlock MM_SZ1\n"); if(0 != munlock(ptr2, MM_SZ2) ) perror("munlock MM_SZ2\n"); if(0 != munlock(ptr3, MM_SZ3) ) perror("munlock MM_SZ3\n"); } int main(int argc, char *argv[]) { int ret, opt; int i,cnt; while((opt = getopt(argc, argv, "c:")) != -1 ) { switch(opt){ case 'c': cnt = atoi(optarg); break; default: printf("Usage: %s [-c count] arg...\n", argv[0]); exit(EXIT_FAILURE); } } for(i = 0; i < cnt; i++) mlock_test(); return 0; } ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: do not drain pagevecs for mlock 2011-12-30 8:48 ` Tao Ma @ 2011-12-30 9:31 ` KOSAKI Motohiro 2011-12-30 9:45 ` Tao Ma 0 siblings, 1 reply; 28+ messages in thread From: KOSAKI Motohiro @ 2011-12-30 9:31 UTC (permalink / raw) To: Tao Ma Cc: linux-mm, linux-kernel, David Rientjes, Minchan Kim, Mel Gorman, Johannes Weiner, Andrew Morton 2011/12/30 Tao Ma <tm@tao.ma>: > On 12/30/2011 04:11 PM, KOSAKI Motohiro wrote: >> 2011/12/30 Tao Ma <tm@tao.ma>: >>> In our test of mlock, we have found some severe performance regression >>> in it. Some more investigations show that mlocked is blocked heavily >>> by lur_add_drain_all which calls schedule_on_each_cpu and flush the work >>> queue which is very slower if we have several cpus. >>> >>> So we have tried 2 ways to solve it: >>> 1. Add a per cpu counter for all the pagevecs so that we don't schedule >>> and flush the lru_drain work if the cpu doesn't have any pagevecs(I >>> have finished the codes already). >>> 2. Remove the lru_add_drain_all. >>> >>> The first one has some problems since in our product system, all the cpus >>> are busy, so I guess there is very little chance for a cpu to have 0 pagevecs >>> except that you run several consecutive mlocks. >>> >>> From the commit log which added this function(8891d6da), it seems that we >>> don't have to call it. So the 2nd one seems to be both easy and workable and >>> comes this patch. >> >> Could you please show us your system environment and benchmark programs? >> Usually lru_drain_** is very fast than mlock() body because it makes >> plenty memset(page). > The system environment is: 16 core Xeon E5620. 24G memory. > > I have attached the program. It is very simple and just uses mlock/munlock. Because your test program is too artificial. 20sec/100000times = 200usec. And your program repeat mlock and munlock the exact same address. so, yes, if lru_add_drain_all() is removed, it become near no-op. but it's worthless comparision. none of any practical program does such strange mlock usage. But, 200usec is much than I measured before. I'll dig it a bit more. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: do not drain pagevecs for mlock 2011-12-30 9:31 ` KOSAKI Motohiro @ 2011-12-30 9:45 ` Tao Ma 2011-12-30 10:07 ` KOSAKI Motohiro 2011-12-30 10:14 ` KOSAKI Motohiro 0 siblings, 2 replies; 28+ messages in thread From: Tao Ma @ 2011-12-30 9:45 UTC (permalink / raw) To: KOSAKI Motohiro Cc: linux-mm, linux-kernel, David Rientjes, Minchan Kim, Mel Gorman, Johannes Weiner, Andrew Morton On 12/30/2011 05:31 PM, KOSAKI Motohiro wrote: > 2011/12/30 Tao Ma <tm@tao.ma>: >> On 12/30/2011 04:11 PM, KOSAKI Motohiro wrote: >>> 2011/12/30 Tao Ma <tm@tao.ma>: >>>> In our test of mlock, we have found some severe performance regression >>>> in it. Some more investigations show that mlocked is blocked heavily >>>> by lur_add_drain_all which calls schedule_on_each_cpu and flush the work >>>> queue which is very slower if we have several cpus. >>>> >>>> So we have tried 2 ways to solve it: >>>> 1. Add a per cpu counter for all the pagevecs so that we don't schedule >>>> and flush the lru_drain work if the cpu doesn't have any pagevecs(I >>>> have finished the codes already). >>>> 2. Remove the lru_add_drain_all. >>>> >>>> The first one has some problems since in our product system, all the cpus >>>> are busy, so I guess there is very little chance for a cpu to have 0 pagevecs >>>> except that you run several consecutive mlocks. >>>> >>>> From the commit log which added this function(8891d6da), it seems that we >>>> don't have to call it. So the 2nd one seems to be both easy and workable and >>>> comes this patch. >>> >>> Could you please show us your system environment and benchmark programs? >>> Usually lru_drain_** is very fast than mlock() body because it makes >>> plenty memset(page). >> The system environment is: 16 core Xeon E5620. 24G memory. >> >> I have attached the program. It is very simple and just uses mlock/munlock. > > Because your test program is too artificial. 20sec/100000times = > 200usec. And your > program repeat mlock and munlock the exact same address. so, yes, if > lru_add_drain_all() is removed, it become near no-op. but it's > worthless comparision. > none of any practical program does such strange mlock usage. yes, I should say it is artificial. But mlock did cause the problem in our product system and perf shows that the mlock uses the system time much more than others. That's the reason we created this program to test whether mlock really sucks. And we compared the result with rhel5(2.6.18) which runs much much faster. And from the commit log you described, we can remove lru_add_drain_all safely here, so why add it? At least removing it makes mlock much faster compared to the vanilla kernel. > > But, 200usec is much than I measured before. I'll dig it a bit more. Thanks for the help. Tao ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: do not drain pagevecs for mlock 2011-12-30 9:45 ` Tao Ma @ 2011-12-30 10:07 ` KOSAKI Motohiro 2012-01-01 7:30 ` [PATCH 1/2] mm,mlock: drain pagevecs asynchronously kosaki.motohiro ` (3 more replies) 2011-12-30 10:14 ` KOSAKI Motohiro 1 sibling, 4 replies; 28+ messages in thread From: KOSAKI Motohiro @ 2011-12-30 10:07 UTC (permalink / raw) To: Tao Ma Cc: linux-mm, linux-kernel, David Rientjes, Minchan Kim, Mel Gorman, Johannes Weiner, Andrew Morton >> Because your test program is too artificial. 20sec/100000times = >> 200usec. And your >> program repeat mlock and munlock the exact same address. so, yes, if >> lru_add_drain_all() is removed, it become near no-op. but it's >> worthless comparision. >> none of any practical program does such strange mlock usage. > yes, I should say it is artificial. But mlock did cause the problem in > our product system and perf shows that the mlock uses the system time > much more than others. That's the reason we created this program to test > whether mlock really sucks. And we compared the result with > rhel5(2.6.18) which runs much much faster. > > And from the commit log you described, we can remove lru_add_drain_all > safely here, so why add it? At least removing it makes mlock much faster > compared to the vanilla kernel. If we remove it, we lose to a test way of mlock. "Memlocked" field of /proc/meminfo show inaccurate number very easily. So, if 200usec is no avoidable, I'll ack you. But I'm not convinced yet. ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 1/2] mm,mlock: drain pagevecs asynchronously 2011-12-30 10:07 ` KOSAKI Motohiro @ 2012-01-01 7:30 ` kosaki.motohiro 2012-01-04 1:17 ` Minchan Kim ` (2 more replies) 2012-01-01 7:30 ` [PATCH 2/2] sysvshm: SHM_LOCK use lru_add_drain_all_async() kosaki.motohiro ` (2 subsequent siblings) 3 siblings, 3 replies; 28+ messages in thread From: kosaki.motohiro @ 2012-01-01 7:30 UTC (permalink / raw) To: linux-mm, linux-kernel Cc: KOSAKI Motohiro, David Rientjes, Minchan Kim, Mel Gorman, Johannes Weiner, Andrew Morton From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Tao Ma reported current mlock is much slower than old 2.6.18 kernel. Because lru_add_drain_all() spent much time. The problem are two. 1) lru_add_drain_all() broadcast a worker thread to all cpus unconditionally. then, the performance penalty is increased in proportion to number of cpus. 2) lru_add_drain_all() wait the worker finished unnecessary. It makes bigger penalty. This patch makes lru_add_drain_all_async() and changes mlock/mlockall use it. Technical side note: - has_pages_lru_pvecs() checks pagevecs locklessly. Of course, it's racy. But it's no matter because asynchronous worker itself is also racy. any lock can't close a race. - Now, we drain pagevec at last of mlock instead of beginning. because a page drain function (____pagevec_lru_add_fn) is PG_mlocked aware now. Then it's safe and it close more race. Without the patch: % time ./test_mlock -c 100000 real 1m13.608s user 0m0.204s sys 0m40.115s i.e. 200usec per mlock With the patch: % time ./test_mlock -c 100000 real 0m3.939s user 0m0.060s sys 0m3.868s i.e. 13usec per mlock test_mlock.c ========================================== #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <errno.h> #include <time.h> #include <sys/time.h> #include <sys/mman.h> #define MM_SZ1 24 #define MM_SZ2 56 #define MM_SZ3 4168 void mlock_test() { char ptr1[MM_SZ1]; char ptr2[MM_SZ2]; char ptr3[MM_SZ3]; if(0 != mlock(ptr1, MM_SZ1) ) perror("mlock MM_SZ1\n"); if(0 != mlock(ptr2, MM_SZ2) ) perror("mlock MM_SZ2\n"); if(0 != mlock(ptr3, MM_SZ3) ) perror("mlock MM_SZ3\n"); if(0 != munlock(ptr1, MM_SZ1) ) perror("munlock MM_SZ1\n"); if(0 != munlock(ptr2, MM_SZ2) ) perror("munlock MM_SZ2\n"); if(0 != munlock(ptr3, MM_SZ3) ) perror("munlock MM_SZ3\n"); } int main(int argc, char *argv[]) { int ret, opt; int i,cnt; while((opt = getopt(argc, argv, "c:")) != -1 ) { switch(opt){ case 'c': cnt = atoi(optarg); break; default: printf("Usage: %s [-c count] arg...\n", argv[0]); exit(EXIT_FAILURE); } } for(i = 0; i < cnt; i++) mlock_test(); return 0; } =========================================== Reported-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> --- include/linux/swap.h | 1 + mm/mlock.c | 7 +---- mm/swap.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++--- 3 files changed, 63 insertions(+), 9 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 1e22e12..11ad301 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -223,6 +223,7 @@ extern void activate_page(struct page *); extern void mark_page_accessed(struct page *); extern void lru_add_drain(void); extern int lru_add_drain_all(void); +extern void lru_add_drain_all_async(void); extern void rotate_reclaimable_page(struct page *page); extern void deactivate_page(struct page *page); extern void swap_setup(void); diff --git a/mm/mlock.c b/mm/mlock.c index 4f4f53b..08f5b6b 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -487,8 +487,6 @@ SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len) if (!can_do_mlock()) return -EPERM; - lru_add_drain_all(); /* flush pagevec */ - down_write(¤t->mm->mmap_sem); len = PAGE_ALIGN(len + (start & ~PAGE_MASK)); start &= PAGE_MASK; @@ -505,6 +503,7 @@ SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len) up_write(¤t->mm->mmap_sem); if (!error) error = do_mlock_pages(start, len, 0); + lru_add_drain_all_async(); return error; } @@ -557,9 +556,6 @@ SYSCALL_DEFINE1(mlockall, int, flags) if (!can_do_mlock()) goto out; - if (flags & MCL_CURRENT) - lru_add_drain_all(); /* flush pagevec */ - down_write(¤t->mm->mmap_sem); lock_limit = rlimit(RLIMIT_MEMLOCK); @@ -573,6 +569,7 @@ SYSCALL_DEFINE1(mlockall, int, flags) if (!ret && (flags & MCL_CURRENT)) { /* Ignore errors */ do_mlock_pages(0, TASK_SIZE, 1); + lru_add_drain_all_async(); } out: return ret; diff --git a/mm/swap.c b/mm/swap.c index a91caf7..2690f04 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -569,6 +569,49 @@ int lru_add_drain_all(void) return schedule_on_each_cpu(lru_add_drain_per_cpu); } +static DEFINE_PER_CPU(struct work_struct, lru_drain_work); + +static int __init lru_drain_work_init(void) +{ + struct work_struct *work; + int cpu; + + for_each_possible_cpu(cpu) { + work = &per_cpu(lru_drain_work, cpu); + INIT_WORK(work, &lru_add_drain_per_cpu); + } + + return 0; +} +core_initcall(lru_drain_work_init); + +static bool has_pages_lru_pvecs(int cpu) +{ + struct pagevec *pvecs = per_cpu(lru_add_pvecs, cpu); + struct pagevec *pvec; + int lru; + + for_each_lru(lru) { + pvec = &pvecs[lru - LRU_BASE]; + if (pagevec_count(pvec)) + return true; + } + + return false; +} + +void lru_add_drain_all_async(void) +{ + int cpu; + + for_each_online_cpu(cpu) { + struct work_struct *work = &per_cpu(lru_drain_work, cpu); + + if (has_pages_lru_pvecs(cpu)) + schedule_work_on(cpu, work); + } +} + /* * Batched page_cache_release(). Decrement the reference count on all the * passed pages. If it fell to zero then remove the page from the LRU and @@ -704,10 +747,23 @@ static void ____pagevec_lru_add_fn(struct page *page, void *arg) VM_BUG_ON(PageLRU(page)); SetPageLRU(page); - if (active) - SetPageActive(page); - update_page_reclaim_stat(zone, page, file, active); - add_page_to_lru_list(zone, page, lru); + redo: + if (page_evictable(page, NULL)) { + if (active) + SetPageActive(page); + update_page_reclaim_stat(zone, page, file, active); + add_page_to_lru_list(zone, page, lru); + } else { + SetPageUnevictable(page); + add_page_to_lru_list(zone, page, LRU_UNEVICTABLE); + smp_mb(); + + if (page_evictable(page, NULL)) { + del_page_from_lru_list(zone, page, LRU_UNEVICTABLE); + ClearPageUnevictable(page); + goto redo; + } + } } /* -- 1.7.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 1/2] mm,mlock: drain pagevecs asynchronously 2012-01-01 7:30 ` [PATCH 1/2] mm,mlock: drain pagevecs asynchronously kosaki.motohiro @ 2012-01-04 1:17 ` Minchan Kim 2012-01-04 2:38 ` KOSAKI Motohiro 2012-01-04 2:56 ` Hugh Dickins 2012-01-04 22:05 ` Andrew Morton 2 siblings, 1 reply; 28+ messages in thread From: Minchan Kim @ 2012-01-04 1:17 UTC (permalink / raw) To: kosaki.motohiro Cc: linux-mm, linux-kernel, KOSAKI Motohiro, David Rientjes, Minchan Kim, Mel Gorman, Johannes Weiner, Andrew Morton Hi KOSAKI, On Sun, Jan 01, 2012 at 02:30:24AM -0500, kosaki.motohiro@gmail.com wrote: > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> > > Tao Ma reported current mlock is much slower than old 2.6.18 kernel. Because > lru_add_drain_all() spent much time. The problem are two. 1) lru_add_drain_all() > broadcast a worker thread to all cpus unconditionally. then, the performance > penalty is increased in proportion to number of cpus. 2) lru_add_drain_all() > wait the worker finished unnecessary. It makes bigger penalty. > > This patch makes lru_add_drain_all_async() and changes mlock/mlockall use it. > > Technical side note: > - has_pages_lru_pvecs() checks pagevecs locklessly. Of course, it's racy. > But it's no matter because asynchronous worker itself is also racy. > any lock can't close a race. > - Now, we drain pagevec at last of mlock instead of beginning. because > a page drain function (____pagevec_lru_add_fn) is PG_mlocked aware now. > Then it's safe and it close more race. > > Without the patch: > % time ./test_mlock -c 100000 > > real 1m13.608s > user 0m0.204s > sys 0m40.115s > > i.e. 200usec per mlock > > With the patch: > % time ./test_mlock -c 100000 > real 0m3.939s > user 0m0.060s > sys 0m3.868s > > i.e. 13usec per mlock > > test_mlock.c > ========================================== > #include <stdio.h> > #include <stdlib.h> > #include <unistd.h> > #include <errno.h> > #include <time.h> > #include <sys/time.h> > #include <sys/mman.h> > > #define MM_SZ1 24 > #define MM_SZ2 56 > #define MM_SZ3 4168 > > void mlock_test() > { > char ptr1[MM_SZ1]; > char ptr2[MM_SZ2]; > char ptr3[MM_SZ3]; > > if(0 != mlock(ptr1, MM_SZ1) ) > perror("mlock MM_SZ1\n"); > if(0 != mlock(ptr2, MM_SZ2) ) > perror("mlock MM_SZ2\n"); > if(0 != mlock(ptr3, MM_SZ3) ) > perror("mlock MM_SZ3\n"); > > if(0 != munlock(ptr1, MM_SZ1) ) > perror("munlock MM_SZ1\n"); > if(0 != munlock(ptr2, MM_SZ2) ) > perror("munlock MM_SZ2\n"); > if(0 != munlock(ptr3, MM_SZ3) ) > perror("munlock MM_SZ3\n"); > } > > int main(int argc, char *argv[]) > { > int ret, opt; > int i,cnt; > > while((opt = getopt(argc, argv, "c:")) != -1 ) > { > switch(opt){ > case 'c': > cnt = atoi(optarg); > break; > default: > printf("Usage: %s [-c count] arg...\n", argv[0]); > exit(EXIT_FAILURE); > } > } > > for(i = 0; i < cnt; i++) > mlock_test(); > > return 0; > } > =========================================== > > Reported-by: Tao Ma <boyu.mt@taobao.com> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> > Cc: David Rientjes <rientjes@google.com> > Cc: Minchan Kim <minchan.kim@gmail.com> > Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> > Cc: Mel Gorman <mel@csn.ul.ie> > Cc: Johannes Weiner <jweiner@redhat.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > --- > include/linux/swap.h | 1 + > mm/mlock.c | 7 +---- > mm/swap.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++--- > 3 files changed, 63 insertions(+), 9 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 1e22e12..11ad301 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -223,6 +223,7 @@ extern void activate_page(struct page *); > extern void mark_page_accessed(struct page *); > extern void lru_add_drain(void); > extern int lru_add_drain_all(void); > +extern void lru_add_drain_all_async(void); > extern void rotate_reclaimable_page(struct page *page); > extern void deactivate_page(struct page *page); > extern void swap_setup(void); > diff --git a/mm/mlock.c b/mm/mlock.c > index 4f4f53b..08f5b6b 100644 > --- a/mm/mlock.c > +++ b/mm/mlock.c > @@ -487,8 +487,6 @@ SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len) > if (!can_do_mlock()) > return -EPERM; > > - lru_add_drain_all(); /* flush pagevec */ > - > down_write(¤t->mm->mmap_sem); > len = PAGE_ALIGN(len + (start & ~PAGE_MASK)); > start &= PAGE_MASK; > @@ -505,6 +503,7 @@ SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len) > up_write(¤t->mm->mmap_sem); > if (!error) > error = do_mlock_pages(start, len, 0); > + lru_add_drain_all_async(); flush pagevec was not a good comment. In this chance, we should add more kind comment why we should drain in here. > return error; > } > > @@ -557,9 +556,6 @@ SYSCALL_DEFINE1(mlockall, int, flags) > if (!can_do_mlock()) > goto out; > > - if (flags & MCL_CURRENT) > - lru_add_drain_all(); /* flush pagevec */ > - > down_write(¤t->mm->mmap_sem); > > lock_limit = rlimit(RLIMIT_MEMLOCK); > @@ -573,6 +569,7 @@ SYSCALL_DEFINE1(mlockall, int, flags) > if (!ret && (flags & MCL_CURRENT)) { > /* Ignore errors */ > do_mlock_pages(0, TASK_SIZE, 1); > + lru_add_drain_all_async(); > } > out: > return ret; > diff --git a/mm/swap.c b/mm/swap.c > index a91caf7..2690f04 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -569,6 +569,49 @@ int lru_add_drain_all(void) > return schedule_on_each_cpu(lru_add_drain_per_cpu); > } > > +static DEFINE_PER_CPU(struct work_struct, lru_drain_work); > + > +static int __init lru_drain_work_init(void) > +{ > + struct work_struct *work; > + int cpu; > + > + for_each_possible_cpu(cpu) { > + work = &per_cpu(lru_drain_work, cpu); > + INIT_WORK(work, &lru_add_drain_per_cpu); > + } > + > + return 0; > +} > +core_initcall(lru_drain_work_init); > + > +static bool has_pages_lru_pvecs(int cpu) > +{ > + struct pagevec *pvecs = per_cpu(lru_add_pvecs, cpu); > + struct pagevec *pvec; > + int lru; > + > + for_each_lru(lru) { > + pvec = &pvecs[lru - LRU_BASE]; > + if (pagevec_count(pvec)) > + return true; > + } > + > + return false; > +} > + > +void lru_add_drain_all_async(void) > +{ > + int cpu; > + > + for_each_online_cpu(cpu) { > + struct work_struct *work = &per_cpu(lru_drain_work, cpu); > + > + if (has_pages_lru_pvecs(cpu)) > + schedule_work_on(cpu, work); > + } > +} > + > /* > * Batched page_cache_release(). Decrement the reference count on all the > * passed pages. If it fell to zero then remove the page from the LRU and > @@ -704,10 +747,23 @@ static void ____pagevec_lru_add_fn(struct page *page, void *arg) > VM_BUG_ON(PageLRU(page)); > > SetPageLRU(page); > - if (active) > - SetPageActive(page); > - update_page_reclaim_stat(zone, page, file, active); > - add_page_to_lru_list(zone, page, lru); > + redo: > + if (page_evictable(page, NULL)) { > + if (active) > + SetPageActive(page); > + update_page_reclaim_stat(zone, page, file, active); > + add_page_to_lru_list(zone, page, lru); > + } else { > + SetPageUnevictable(page); > + add_page_to_lru_list(zone, page, LRU_UNEVICTABLE); > + smp_mb(); Why do we need barrier in here? Please comment it. > + > + if (page_evictable(page, NULL)) { > + del_page_from_lru_list(zone, page, LRU_UNEVICTABLE); > + ClearPageUnevictable(page); > + goto redo; > + } > + } I am not sure it's a good idea. mlock is very rare event but ____pagevec_lru_add_fn is called frequently. We are adding more overhead in ____pagevec_lru_add_fn. Is it valuable? > } > > /* > -- > 1.7.1 > -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/2] mm,mlock: drain pagevecs asynchronously 2012-01-04 1:17 ` Minchan Kim @ 2012-01-04 2:38 ` KOSAKI Motohiro 2012-01-10 8:53 ` Tao Ma 0 siblings, 1 reply; 28+ messages in thread From: KOSAKI Motohiro @ 2012-01-04 2:38 UTC (permalink / raw) To: Minchan Kim Cc: linux-mm, linux-kernel, KOSAKI Motohiro, David Rientjes, Minchan Kim, Mel Gorman, Johannes Weiner, Andrew Morton >> @@ -704,10 +747,23 @@ static void ____pagevec_lru_add_fn(struct page *page, void *arg) >> VM_BUG_ON(PageLRU(page)); >> >> SetPageLRU(page); >> - if (active) >> - SetPageActive(page); >> - update_page_reclaim_stat(zone, page, file, active); >> - add_page_to_lru_list(zone, page, lru); >> + redo: >> + if (page_evictable(page, NULL)) { >> + if (active) >> + SetPageActive(page); >> + update_page_reclaim_stat(zone, page, file, active); >> + add_page_to_lru_list(zone, page, lru); >> + } else { >> + SetPageUnevictable(page); >> + add_page_to_lru_list(zone, page, LRU_UNEVICTABLE); >> + smp_mb(); > > Why do we need barrier in here? Please comment it. To cut-n-paste a comment from putback_lru_page() is good idea? :) + /* + * When racing with an mlock clearing (page is + * unlocked), make sure that if the other thread does + * not observe our setting of PG_lru and fails + * isolation, we see PG_mlocked cleared below and move + * the page back to the evictable list. + * + * The other side is TestClearPageMlocked(). + */ + smp_mb(); >> + if (page_evictable(page, NULL)) { >> + del_page_from_lru_list(zone, page, LRU_UNEVICTABLE); >> + ClearPageUnevictable(page); >> + goto redo; >> + } >> + } > > I am not sure it's a good idea. > mlock is very rare event but ____pagevec_lru_add_fn is called frequently. > We are adding more overhead in ____pagevec_lru_add_fn. > Is it valuable? dunno. Personally, I think tao's case is too artificial and I haven't observed any real world application do such crazy mlock/munlock repeatness. But he said he has a such application. If my remember is correct, ltp or some test suite depend on current meminfo synching behavior. then I'm afraid simple removing bring us new annoying bug report. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/2] mm,mlock: drain pagevecs asynchronously 2012-01-04 2:38 ` KOSAKI Motohiro @ 2012-01-10 8:53 ` Tao Ma 0 siblings, 0 replies; 28+ messages in thread From: Tao Ma @ 2012-01-10 8:53 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Minchan Kim, linux-mm, linux-kernel, KOSAKI Motohiro, David Rientjes, Minchan Kim, Mel Gorman, Johannes Weiner, Andrew Morton Hi KOSAKI, On 01/04/2012 10:38 AM, KOSAKI Motohiro wrote: > >>> @@ -704,10 +747,23 @@ static void ____pagevec_lru_add_fn(struct page >>> *page, void *arg) >>> VM_BUG_ON(PageLRU(page)); >>> >>> SetPageLRU(page); >>> - if (active) >>> - SetPageActive(page); >>> - update_page_reclaim_stat(zone, page, file, active); >>> - add_page_to_lru_list(zone, page, lru); >>> + redo: >>> + if (page_evictable(page, NULL)) { >>> + if (active) >>> + SetPageActive(page); >>> + update_page_reclaim_stat(zone, page, file, active); >>> + add_page_to_lru_list(zone, page, lru); >>> + } else { >>> + SetPageUnevictable(page); >>> + add_page_to_lru_list(zone, page, LRU_UNEVICTABLE); >>> + smp_mb(); >> >> Why do we need barrier in here? Please comment it. > > To cut-n-paste a comment from putback_lru_page() is good idea? :) > > + /* > + * When racing with an mlock clearing (page is > + * unlocked), make sure that if the other thread does > + * not observe our setting of PG_lru and fails > + * isolation, we see PG_mlocked cleared below and move > + * the page back to the evictable list. > + * > + * The other side is TestClearPageMlocked(). > + */ > + smp_mb(); > > > >>> + if (page_evictable(page, NULL)) { >>> + del_page_from_lru_list(zone, page, LRU_UNEVICTABLE); >>> + ClearPageUnevictable(page); >>> + goto redo; >>> + } >>> + } >> >> I am not sure it's a good idea. >> mlock is very rare event but ____pagevec_lru_add_fn is called frequently. >> We are adding more overhead in ____pagevec_lru_add_fn. >> Is it valuable? > > dunno. > > Personally, I think tao's case is too artificial and I haven't observed > any real world application do such crazy mlock/munlock repeatness. But > he said he has a such application. ok, I will talk more about our application here. So it is backend of a php. And for every user request, we will have to call libmcrypt(http://sourceforge.net/projects/mcrypt/) several times to encrypt some information, and libmcrypt will use mlock/munlock. As a server can finish many requests in one second, so the total mlock/munlock counts will sum up to around 2000 and it really means some for us. > > If my remember is correct, ltp or some test suite depend on current > meminfo synching behavior. then I'm afraid simple removing bring us > new annoying bug report. So this is the only side effect for removing the lru_add_drain_all from mlock/mlockall right? Is there any other know issues? I have read Andrew's comment, and if we have decided to remove all these lru_* stuff, it seems that we have a long way to go before this issue can be completed resolved. So I will remove it from our production kernel first and wait for your final cleanup. Great thanks for your time and kindly help. Thanks Tao ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/2] mm,mlock: drain pagevecs asynchronously 2012-01-01 7:30 ` [PATCH 1/2] mm,mlock: drain pagevecs asynchronously kosaki.motohiro 2012-01-04 1:17 ` Minchan Kim @ 2012-01-04 2:56 ` Hugh Dickins 2012-01-04 22:05 ` Andrew Morton 2 siblings, 0 replies; 28+ messages in thread From: Hugh Dickins @ 2012-01-04 2:56 UTC (permalink / raw) To: kosaki.motohiro Cc: linux-mm, linux-kernel, KOSAKI Motohiro, David Rientjes, Minchan Kim, Mel Gorman, Johannes Weiner, Andrew Morton, Michel Lespinasse On Sun, 1 Jan 2012, kosaki.motohiro@gmail.com wrote: > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> > > Tao Ma reported current mlock is much slower than old 2.6.18 kernel. Because > lru_add_drain_all() spent much time. The problem are two. 1) lru_add_drain_all() > broadcast a worker thread to all cpus unconditionally. then, the performance > penalty is increased in proportion to number of cpus. 2) lru_add_drain_all() > wait the worker finished unnecessary. It makes bigger penalty. > > This patch makes lru_add_drain_all_async() and changes mlock/mlockall use it. > > Technical side note: > - has_pages_lru_pvecs() checks pagevecs locklessly. Of course, it's racy. > But it's no matter because asynchronous worker itself is also racy. > any lock can't close a race. > - Now, we drain pagevec at last of mlock instead of beginning. because > a page drain function (____pagevec_lru_add_fn) is PG_mlocked aware now. > Then it's safe and it close more race. > > Without the patch: > % time ./test_mlock -c 100000 > > real 1m13.608s > user 0m0.204s > sys 0m40.115s > > i.e. 200usec per mlock > > With the patch: > % time ./test_mlock -c 100000 > real 0m3.939s > user 0m0.060s > sys 0m3.868s > > i.e. 13usec per mlock That's very nice; but it is an artificial test of the codepath you have clearly speeded up, without any results for the codepaths you have probably slowed down - I see Minchan is worried about that too. > > test_mlock.c > ========================================== > #include <stdio.h> > #include <stdlib.h> > #include <unistd.h> > #include <errno.h> > #include <time.h> > #include <sys/time.h> > #include <sys/mman.h> > > #define MM_SZ1 24 > #define MM_SZ2 56 > #define MM_SZ3 4168 > > void mlock_test() > { > char ptr1[MM_SZ1]; > char ptr2[MM_SZ2]; > char ptr3[MM_SZ3]; > > if(0 != mlock(ptr1, MM_SZ1) ) > perror("mlock MM_SZ1\n"); > if(0 != mlock(ptr2, MM_SZ2) ) > perror("mlock MM_SZ2\n"); > if(0 != mlock(ptr3, MM_SZ3) ) > perror("mlock MM_SZ3\n"); > > if(0 != munlock(ptr1, MM_SZ1) ) > perror("munlock MM_SZ1\n"); > if(0 != munlock(ptr2, MM_SZ2) ) > perror("munlock MM_SZ2\n"); > if(0 != munlock(ptr3, MM_SZ3) ) > perror("munlock MM_SZ3\n"); > } > > int main(int argc, char *argv[]) > { > int ret, opt; > int i,cnt; > > while((opt = getopt(argc, argv, "c:")) != -1 ) > { > switch(opt){ > case 'c': > cnt = atoi(optarg); > break; > default: > printf("Usage: %s [-c count] arg...\n", argv[0]); > exit(EXIT_FAILURE); > } > } > > for(i = 0; i < cnt; i++) > mlock_test(); > > return 0; > } > =========================================== > > Reported-by: Tao Ma <boyu.mt@taobao.com> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> > Cc: David Rientjes <rientjes@google.com> > Cc: Minchan Kim <minchan.kim@gmail.com> > Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> > Cc: Mel Gorman <mel@csn.ul.ie> > Cc: Johannes Weiner <jweiner@redhat.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > --- > include/linux/swap.h | 1 + > mm/mlock.c | 7 +---- > mm/swap.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++--- > 3 files changed, 63 insertions(+), 9 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 1e22e12..11ad301 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -223,6 +223,7 @@ extern void activate_page(struct page *); > extern void mark_page_accessed(struct page *); > extern void lru_add_drain(void); > extern int lru_add_drain_all(void); > +extern void lru_add_drain_all_async(void); > extern void rotate_reclaimable_page(struct page *page); > extern void deactivate_page(struct page *page); > extern void swap_setup(void); > diff --git a/mm/mlock.c b/mm/mlock.c > index 4f4f53b..08f5b6b 100644 > --- a/mm/mlock.c > +++ b/mm/mlock.c > @@ -487,8 +487,6 @@ SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len) > if (!can_do_mlock()) > return -EPERM; > > - lru_add_drain_all(); /* flush pagevec */ > - > down_write(¤t->mm->mmap_sem); > len = PAGE_ALIGN(len + (start & ~PAGE_MASK)); > start &= PAGE_MASK; > @@ -505,6 +503,7 @@ SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len) > up_write(¤t->mm->mmap_sem); > if (!error) > error = do_mlock_pages(start, len, 0); > + lru_add_drain_all_async(); > return error; > } > > @@ -557,9 +556,6 @@ SYSCALL_DEFINE1(mlockall, int, flags) > if (!can_do_mlock()) > goto out; > > - if (flags & MCL_CURRENT) > - lru_add_drain_all(); /* flush pagevec */ > - > down_write(¤t->mm->mmap_sem); > > lock_limit = rlimit(RLIMIT_MEMLOCK); > @@ -573,6 +569,7 @@ SYSCALL_DEFINE1(mlockall, int, flags) > if (!ret && (flags & MCL_CURRENT)) { > /* Ignore errors */ > do_mlock_pages(0, TASK_SIZE, 1); > + lru_add_drain_all_async(); > } > out: > return ret; > diff --git a/mm/swap.c b/mm/swap.c > index a91caf7..2690f04 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -569,6 +569,49 @@ int lru_add_drain_all(void) > return schedule_on_each_cpu(lru_add_drain_per_cpu); > } > > +static DEFINE_PER_CPU(struct work_struct, lru_drain_work); > + > +static int __init lru_drain_work_init(void) > +{ > + struct work_struct *work; > + int cpu; > + > + for_each_possible_cpu(cpu) { > + work = &per_cpu(lru_drain_work, cpu); > + INIT_WORK(work, &lru_add_drain_per_cpu); > + } > + > + return 0; > +} > +core_initcall(lru_drain_work_init); > + > +static bool has_pages_lru_pvecs(int cpu) I couldn't make sense of that name! Though, to be fair, I never made sense of "lru_add_drain" either: I always imagine a plumber adding a drain. pages_are_pending_lru_add(cpu)? > +{ > + struct pagevec *pvecs = per_cpu(lru_add_pvecs, cpu); > + struct pagevec *pvec; > + int lru; > + > + for_each_lru(lru) { > + pvec = &pvecs[lru - LRU_BASE]; Hmm, yes, I see other such loops say "- LRU_BASE" too. Good thing LRU_BASE is 0! > + if (pagevec_count(pvec)) > + return true; > + } > + > + return false; > +} > + > +void lru_add_drain_all_async(void) > +{ > + int cpu; > + > + for_each_online_cpu(cpu) { > + struct work_struct *work = &per_cpu(lru_drain_work, cpu); > + > + if (has_pages_lru_pvecs(cpu)) > + schedule_work_on(cpu, work); > + } > +} You'll be peeking at a lot of remote cpus here, rather than sending them all an IPI. My guess is that it is more efficient your way, but that's not as obvious as I'd like. I haven't added Gilad Ben-Yossef to the Cc list, but be aware that he's particularly interested in reducing IPIs, and has some patches on lkml. I have added Michel Lespinasse to the Cc list, he's good to Cc on mlock matters. He thought that if the pages_are_pending check is a win, then you should do it in lru_add_drain_all() too. But more, he wondered if we couldn't do a better job without all this, by doing just lru_add_drain() first, and then doing the extra work (of prodding other cpus) only if we come across a page which needs it. > + > /* > * Batched page_cache_release(). Decrement the reference count on all the > * passed pages. If it fell to zero then remove the page from the LRU and > @@ -704,10 +747,23 @@ static void ____pagevec_lru_add_fn(struct page *page, void *arg) > VM_BUG_ON(PageLRU(page)); > > SetPageLRU(page); > - if (active) > - SetPageActive(page); > - update_page_reclaim_stat(zone, page, file, active); > - add_page_to_lru_list(zone, page, lru); > + redo: > + if (page_evictable(page, NULL)) { > + if (active) > + SetPageActive(page); > + update_page_reclaim_stat(zone, page, file, active); > + add_page_to_lru_list(zone, page, lru); > + } else { > + SetPageUnevictable(page); > + add_page_to_lru_list(zone, page, LRU_UNEVICTABLE); > + smp_mb(); > + > + if (page_evictable(page, NULL)) { > + del_page_from_lru_list(zone, page, LRU_UNEVICTABLE); > + ClearPageUnevictable(page); > + goto redo; > + } > + } > } Right, this is where the overhead comes, in everybody's pagevec_lru_add_fn. Now, I do like that you're resolving the un/evictable issue here, and if the overhead really doesn't hurt, then this is a nice way to go. But I think you should go further to reduce your overhead. For one thing, certainly take that silly almost-always-NULL vma arg out of page_unevictable, and test vma directly in the one or two places that want it; maybe make page_unevictable(page) an mm_inline function. And update those various places (many or all in vmscan.c) which are asking if page_unevictable(), to avoid using pagevec for unevictable page because they know it doesn't work: you are now making it work. Hah, I see my own putback_lru(or inactive)_pages patch is open to the same criticism: if I'm now using a page_list there, it could handle the unevictable pages directly instead of having to avoid them. I think. (I've never noticed before, it worries me now that page_unevictable() is referring to page mapping flags at a point when we generally have no hold on the mapping at all; but I've never heard of that being a problem, and I guess if the mapping is stale, the page will very soon be freed, so no matter if it went on the Unevictable list for a moment.) Hugh ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/2] mm,mlock: drain pagevecs asynchronously 2012-01-01 7:30 ` [PATCH 1/2] mm,mlock: drain pagevecs asynchronously kosaki.motohiro 2012-01-04 1:17 ` Minchan Kim 2012-01-04 2:56 ` Hugh Dickins @ 2012-01-04 22:05 ` Andrew Morton 2012-01-04 23:33 ` KOSAKI Motohiro 2 siblings, 1 reply; 28+ messages in thread From: Andrew Morton @ 2012-01-04 22:05 UTC (permalink / raw) To: kosaki.motohiro Cc: linux-mm, linux-kernel, KOSAKI Motohiro, David Rientjes, Minchan Kim, Mel Gorman, Johannes Weiner On Sun, 1 Jan 2012 02:30:24 -0500 kosaki.motohiro@gmail.com wrote: > Because lru_add_drain_all() spent much time. Those LRU pagevecs are horrid things. They add high code and conceptual complexity, they add pointless uniprocessor overhead and the way in which they leave LRU pages floating around not on an LRU is rather maddening. So the best way to fix all of this as well as this problem we're observing is, I hope, to completely remove them. They've been in there for ~10 years and at the time they were quite beneficial in reducing lru_lock contention, hold times, acquisition frequency, etc. The approach to take here is to prepare the patches which eliminate lru_*_pvecs then identify the problems which occur as a result, via code inspection and runtime testing. Then fix those up. Many sites which take lru_lock are already batching the operation. It's a matter of hunting down those sites which take the lock once-per-page and, if they have high frequency, batch them up. Converting readahead to batch the locking will be pretty simple (read_pages(), mpage_readpages(), others). That will fix pagefaults too. rotate_reclaimable_page() can be batched by batching end_page_writeback(): a bio contains many pages already. deactivate_page() can be batched too - invalidate_mapping_pages() is already working on large chunks of pages. Those three cases are fairly simple - we just didn't try, because the lru_*_pvecs were there to do the work for us. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/2] mm,mlock: drain pagevecs asynchronously 2012-01-04 22:05 ` Andrew Morton @ 2012-01-04 23:33 ` KOSAKI Motohiro 2012-01-05 0:19 ` Hugh Dickins 0 siblings, 1 reply; 28+ messages in thread From: KOSAKI Motohiro @ 2012-01-04 23:33 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, linux-kernel, KOSAKI Motohiro, David Rientjes, Minchan Kim, Mel Gorman, Johannes Weiner (1/4/12 5:05 PM), Andrew Morton wrote: > On Sun, 1 Jan 2012 02:30:24 -0500 > kosaki.motohiro@gmail.com wrote: > >> Because lru_add_drain_all() spent much time. > > Those LRU pagevecs are horrid things. They add high code and > conceptual complexity, they add pointless uniprocessor overhead and the > way in which they leave LRU pages floating around not on an LRU is > rather maddening. > > So the best way to fix all of this as well as this problem we're > observing is, I hope, to completely remove them. > > They've been in there for ~10 years and at the time they were quite > beneficial in reducing lru_lock contention, hold times, acquisition > frequency, etc. > > The approach to take here is to prepare the patches which eliminate > lru_*_pvecs then identify the problems which occur as a result, via > code inspection and runtime testing. Then fix those up. > > Many sites which take lru_lock are already batching the operation. > It's a matter of hunting down those sites which take the lock > once-per-page and, if they have high frequency, batch them up. > > Converting readahead to batch the locking will be pretty simple > (read_pages(), mpage_readpages(), others). That will fix pagefaults > too. > > rotate_reclaimable_page() can be batched by batching > end_page_writeback(): a bio contains many pages already. > > deactivate_page() can be batched too - invalidate_mapping_pages() is > already working on large chunks of pages. > > Those three cases are fairly simple - we just didn't try, because the > lru_*_pvecs were there to do the work for us. got it. so, let's wait hugh's "mm: take pagevecs off reclaim stack" next spin and make the patches on top of it. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/2] mm,mlock: drain pagevecs asynchronously 2012-01-04 23:33 ` KOSAKI Motohiro @ 2012-01-05 0:19 ` Hugh Dickins 0 siblings, 0 replies; 28+ messages in thread From: Hugh Dickins @ 2012-01-05 0:19 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Andrew Morton, linux-mm, linux-kernel, KOSAKI Motohiro, David Rientjes, Minchan Kim, Mel Gorman, Johannes Weiner On Wed, 4 Jan 2012, KOSAKI Motohiro wrote: > (1/4/12 5:05 PM), Andrew Morton wrote: > > On Sun, 1 Jan 2012 02:30:24 -0500 > > kosaki.motohiro@gmail.com wrote: > > > > > Because lru_add_drain_all() spent much time. > > > > Those LRU pagevecs are horrid things. They add high code and > > conceptual complexity, they add pointless uniprocessor overhead and the > > way in which they leave LRU pages floating around not on an LRU is > > rather maddening. Yes, we continue to have difficulties with not-quite-PageLRU-yet pages. > > > > So the best way to fix all of this as well as this problem we're > > observing is, I hope, to completely remove them. Nice aim, sounds like a dirty job. I wonder how far we could get using lru_add_drain, avoiding lru_add_drain_all, and flushing pvec when pre-empted. > ... > > got it. so, let's wait hugh's "mm: take pagevecs off reclaim stack" next spin > and make the patches on top of it. Don't wait on me, I wasn't intending another spin, with Andrew's last word on it today: > If we already have the latency problem at the isolate_lru_pages() stage > then I suppose we can assume that nobody is noticing it, so we'll > probably be OK. > > For a while. Someone will complain at some stage and we'll probably > end up busting this work into chunks. and mm-commits does presently have my mm-rearrange-putback-inactive-pages.patch in on top of it. Besides, these free_hot_cold_page_list() users are already avoiding the lru add pagevecs which Andrew is nudging towards removing above, so there shouldn't be much overlap. Or maybe you're thinking of my observation that it could avoid the !page_evictable putback_lru_page special case now: yes, I'd like to make that change sometime, but moved away to other things for now. Hugh ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 2/2] sysvshm: SHM_LOCK use lru_add_drain_all_async() 2011-12-30 10:07 ` KOSAKI Motohiro 2012-01-01 7:30 ` [PATCH 1/2] mm,mlock: drain pagevecs asynchronously kosaki.motohiro @ 2012-01-01 7:30 ` kosaki.motohiro 2012-01-04 1:51 ` Hugh Dickins 2012-01-06 6:13 ` [PATCH] mm: do not drain pagevecs for mlock Tao Ma 2012-01-09 7:25 ` Tao Ma 3 siblings, 1 reply; 28+ messages in thread From: kosaki.motohiro @ 2012-01-01 7:30 UTC (permalink / raw) To: linux-mm, linux-kernel; +Cc: KOSAKI Motohiro From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> shmctl also don't need synchrounous pagevec drain. This patch replace it with lru_add_drain_all_async(). Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --- ipc/shm.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/ipc/shm.c b/ipc/shm.c index 02ecf2c..1eb25f0 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -872,8 +872,6 @@ SYSCALL_DEFINE3(shmctl, int, shmid, int, cmd, struct shmid_ds __user *, buf) { struct file *uninitialized_var(shm_file); - lru_add_drain_all(); /* drain pagevecs to lru lists */ - shp = shm_lock_check(ns, shmid); if (IS_ERR(shp)) { err = PTR_ERR(shp); @@ -911,6 +909,8 @@ SYSCALL_DEFINE3(shmctl, int, shmid, int, cmd, struct shmid_ds __user *, buf) shp->mlock_user = NULL; } shm_unlock(shp); + /* prevent user visible mismatch of unevictable accounting */ + lru_add_drain_all_async(); goto out; } case IPC_RMID: -- 1.7.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 2/2] sysvshm: SHM_LOCK use lru_add_drain_all_async() 2012-01-01 7:30 ` [PATCH 2/2] sysvshm: SHM_LOCK use lru_add_drain_all_async() kosaki.motohiro @ 2012-01-04 1:51 ` Hugh Dickins 2012-01-04 2:19 ` KOSAKI Motohiro 0 siblings, 1 reply; 28+ messages in thread From: Hugh Dickins @ 2012-01-04 1:51 UTC (permalink / raw) To: kosaki.motohiro; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro On Sun, 1 Jan 2012, kosaki.motohiro@gmail.com wrote: > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> > > shmctl also don't need synchrounous pagevec drain. This patch replace it with > lru_add_drain_all_async(). > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Let me answer this 2/2 first since it's easier. I'm going to thank you for bringing this lru_add_drain_all() to my attention, I had not noticed it; but Nak the patch itself. The reason being, that particular lru_add_drain_all() serves no useful purpose, so let's delete it instead of replacing it. I believe that it serves no purpose for SHM_LOCK and no purpose for SHM_UNLOCK. I'm dabbling in this area myself, since you so cogently pointed out that I'd tried to add a cond_resched() to scan_mapping_unevictable_pages() (which is a helper for SHM_UNLOCK here) while it's under spinlock. In testing my fix for that, I find that there has been no attempt to keep the Unevictable count accurate on SysVShm: SHM_LOCK pages get marked unevictable lazily later as memory pressure discovers them - which perhaps mirrors the way in which SHM_LOCK makes no attempt to instantiate pages, unlike mlock. Since nobody has complained about that in the two years since we've had an Unevictable count in /proc/meminfo, I don't see any need to add code (it would need more than just your change here; would need more even than calling scan_mapping_unevictable_pages() at SHM_LOCK time - though perhaps along with your 1/2 that could handle it) and overhead to satisfy a need that nobody has. I'll delete that lru_add_drain_all() in my patch, okay? (But in writing this, realize I still don't quite understand why the Unevictable count takes a second or two to get back to 0 after SHM_UNLOCK: perhaps I've more to discover.) Hugh > --- > ipc/shm.c | 4 ++-- > 1 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/ipc/shm.c b/ipc/shm.c > index 02ecf2c..1eb25f0 100644 > --- a/ipc/shm.c > +++ b/ipc/shm.c > @@ -872,8 +872,6 @@ SYSCALL_DEFINE3(shmctl, int, shmid, int, cmd, struct shmid_ds __user *, buf) > { > struct file *uninitialized_var(shm_file); > > - lru_add_drain_all(); /* drain pagevecs to lru lists */ > - > shp = shm_lock_check(ns, shmid); > if (IS_ERR(shp)) { > err = PTR_ERR(shp); > @@ -911,6 +909,8 @@ SYSCALL_DEFINE3(shmctl, int, shmid, int, cmd, struct shmid_ds __user *, buf) > shp->mlock_user = NULL; > } > shm_unlock(shp); > + /* prevent user visible mismatch of unevictable accounting */ > + lru_add_drain_all_async(); > goto out; > } > case IPC_RMID: > -- > 1.7.1 ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/2] sysvshm: SHM_LOCK use lru_add_drain_all_async() 2012-01-04 1:51 ` Hugh Dickins @ 2012-01-04 2:19 ` KOSAKI Motohiro 2012-01-04 5:17 ` Hugh Dickins 0 siblings, 1 reply; 28+ messages in thread From: KOSAKI Motohiro @ 2012-01-04 2:19 UTC (permalink / raw) To: Hugh Dickins; +Cc: linux-mm, linux-kernel, KOSAKI Motohiro (1/3/12 8:51 PM), Hugh Dickins wrote: > On Sun, 1 Jan 2012, kosaki.motohiro@gmail.com wrote: >> From: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com> >> >> shmctl also don't need synchrounous pagevec drain. This patch replace it with >> lru_add_drain_all_async(). >> >> Signed-off-by: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com> > > Let me answer this 2/2 first since it's easier. > > I'm going to thank you for bringing this lru_add_drain_all() > to my attention, I had not noticed it; but Nak the patch itself. > > The reason being, that particular lru_add_drain_all() serves no > useful purpose, so let's delete it instead of replacing it. I believe > that it serves no purpose for SHM_LOCK and no purpose for SHM_UNLOCK. > > I'm dabbling in this area myself, since you so cogently pointed out that > I'd tried to add a cond_resched() to scan_mapping_unevictable_pages() > (which is a helper for SHM_UNLOCK here) while it's under spinlock. > > In testing my fix for that, I find that there has been no attempt to > keep the Unevictable count accurate on SysVShm: SHM_LOCK pages get > marked unevictable lazily later as memory pressure discovers them - > which perhaps mirrors the way in which SHM_LOCK makes no attempt to > instantiate pages, unlike mlock. Ugh, you are right. I'm recovering my remember gradually. Lee implemented immediate lru off logic at first and I killed it to close a race. I completely forgot. So, yes, now SHM_LOCK has no attempt to instantiate pages. I'm ashamed. > > Since nobody has complained about that in the two years since we've > had an Unevictable count in /proc/meminfo, I don't see any need to > add code (it would need more than just your change here; would need > more even than calling scan_mapping_unevictable_pages() at SHM_LOCK > time - though perhaps along with your 1/2 that could handle it) and > overhead to satisfy a need that nobody has. > > I'll delete that lru_add_drain_all() in my patch, okay? Sure thing. :) > (But in writing this, realize I still don't quite understand why > the Unevictable count takes a second or two to get back to 0 after > SHM_UNLOCK: perhaps I've more to discover.) Interesting. I'm looking at this too. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/2] sysvshm: SHM_LOCK use lru_add_drain_all_async() 2012-01-04 2:19 ` KOSAKI Motohiro @ 2012-01-04 5:17 ` Hugh Dickins 2012-01-04 8:34 ` KOSAKI Motohiro 0 siblings, 1 reply; 28+ messages in thread From: Hugh Dickins @ 2012-01-04 5:17 UTC (permalink / raw) To: KOSAKI Motohiro Cc: linux-mm, linux-kernel, KOSAKI Motohiro, Michel Lespinasse On Tue, 3 Jan 2012, KOSAKI Motohiro wrote: > (1/3/12 8:51 PM), Hugh Dickins wrote: > > > > In testing my fix for that, I find that there has been no attempt to > > keep the Unevictable count accurate on SysVShm: SHM_LOCK pages get > > marked unevictable lazily later as memory pressure discovers them - > > which perhaps mirrors the way in which SHM_LOCK makes no attempt to > > instantiate pages, unlike mlock. > > Ugh, you are right. I'm recovering my remember gradually. Lee implemented > immediate lru off logic at first and I killed it > to close a race. I completely forgot. So, yes, now SHM_LOCK has no attempt to > instantiate pages. I'm ashamed. Why ashamed? The shmctl man-page documents "The caller must fault in any pages that are required to be present after locking is enabled." That's just how it behaves. > > (But in writing this, realize I still don't quite understand why > > the Unevictable count takes a second or two to get back to 0 after > > SHM_UNLOCK: perhaps I've more to discover.) > > Interesting. I'm looking at this too. In case you got distracted before you found it, mm/vmstat.c's static DEFINE_PER_CPU(struct delayed_work, vmstat_work); int sysctl_stat_interval __read_mostly = HZ; static void vmstat_update(struct work_struct *w) { refresh_cpu_vm_stats(smp_processor_id()); schedule_delayed_work(&__get_cpu_var(vmstat_work), round_jiffies_relative(sysctl_stat_interval)); } would be why, I think. And that implies to me that your lru_add_drain_all_async() is not necessary, you'd get just as good an effect, more cheaply, by doing a local lru_add_drain() before the refresh in vmstat_update(). But it would still require your changes to ____pagevec_lru_add_fn(), if those turn out to help more than they hurt. Hugh ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/2] sysvshm: SHM_LOCK use lru_add_drain_all_async() 2012-01-04 5:17 ` Hugh Dickins @ 2012-01-04 8:34 ` KOSAKI Motohiro 0 siblings, 0 replies; 28+ messages in thread From: KOSAKI Motohiro @ 2012-01-04 8:34 UTC (permalink / raw) To: Hugh Dickins; +Cc: linux-mm, linux-kernel, Michel Lespinasse 2012/1/4 Hugh Dickins <hughd@google.com>: > On Tue, 3 Jan 2012, KOSAKI Motohiro wrote: >> (1/3/12 8:51 PM), Hugh Dickins wrote: >> > >> > In testing my fix for that, I find that there has been no attempt to >> > keep the Unevictable count accurate on SysVShm: SHM_LOCK pages get >> > marked unevictable lazily later as memory pressure discovers them - >> > which perhaps mirrors the way in which SHM_LOCK makes no attempt to >> > instantiate pages, unlike mlock. >> >> Ugh, you are right. I'm recovering my remember gradually. Lee implemented >> immediate lru off logic at first and I killed it >> to close a race. I completely forgot. So, yes, now SHM_LOCK has no attempt to >> instantiate pages. I'm ashamed. > > Why ashamed? The shmctl man-page documents "The caller must fault in any > pages that are required to be present after locking is enabled." That's > just how it behaves. hehe, I have big bad reputation about for bad remember capabilities from my friends. I should have remembered what i implemented. ;-) >> > (But in writing this, realize I still don't quite understand why >> > the Unevictable count takes a second or two to get back to 0 after >> > SHM_UNLOCK: perhaps I've more to discover.) >> >> Interesting. I'm looking at this too. > > In case you got distracted before you found it, mm/vmstat.c's > > static DEFINE_PER_CPU(struct delayed_work, vmstat_work); > int sysctl_stat_interval __read_mostly = HZ; > > static void vmstat_update(struct work_struct *w) > { > refresh_cpu_vm_stats(smp_processor_id()); > schedule_delayed_work(&__get_cpu_var(vmstat_work), > round_jiffies_relative(sysctl_stat_interval)); > } > > would be why, I think. And that implies to me that your > lru_add_drain_all_async() is not necessary, you'd get just as good > an effect, more cheaply, by doing a local lru_add_drain() before the > refresh in vmstat_update(). When, I implement lru_add_drain_all_async(), I thought this idea. I don't dislike both. But if we take vmstat_update() one, I think we need more tricks. pcp draining in refresh_cpu_vm_stats() delays up to 3 seconds. Why? round_jiffies_relative() don't silly round to HZ boundary. Instead of, it adds a few unique offset per each cpus. thus, 3 seconds mean max 3000cpus don't make zone_{lru_}lock contention. pagevec draining also need same trick for rescue SGI UV. It might be too pessimistic concern. but vmstat_update() shouldn't make obsevable lock contention. > But it would still require your changes to ____pagevec_lru_add_fn(), > if those turn out to help more than they hurt. I agree. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: do not drain pagevecs for mlock 2011-12-30 10:07 ` KOSAKI Motohiro 2012-01-01 7:30 ` [PATCH 1/2] mm,mlock: drain pagevecs asynchronously kosaki.motohiro 2012-01-01 7:30 ` [PATCH 2/2] sysvshm: SHM_LOCK use lru_add_drain_all_async() kosaki.motohiro @ 2012-01-06 6:13 ` Tao Ma 2012-01-06 6:18 ` KOSAKI Motohiro 2012-01-09 7:25 ` Tao Ma 3 siblings, 1 reply; 28+ messages in thread From: Tao Ma @ 2012-01-06 6:13 UTC (permalink / raw) To: KOSAKI Motohiro Cc: linux-mm, linux-kernel, David Rientjes, Minchan Kim, Mel Gorman, Johannes Weiner, Andrew Morton Hi Kosaki, On 12/30/2011 06:07 PM, KOSAKI Motohiro wrote: >>> Because your test program is too artificial. 20sec/100000times = >>> 200usec. And your >>> program repeat mlock and munlock the exact same address. so, yes, if >>> lru_add_drain_all() is removed, it become near no-op. but it's >>> worthless comparision. >>> none of any practical program does such strange mlock usage. >> yes, I should say it is artificial. But mlock did cause the problem in >> our product system and perf shows that the mlock uses the system time >> much more than others. That's the reason we created this program to test >> whether mlock really sucks. And we compared the result with >> rhel5(2.6.18) which runs much much faster. >> >> And from the commit log you described, we can remove lru_add_drain_all >> safely here, so why add it? At least removing it makes mlock much faster >> compared to the vanilla kernel. > > If we remove it, we lose to a test way of mlock. "Memlocked" field of > /proc/meminfo > show inaccurate number very easily. So, if 200usec is no avoidable, > I'll ack you. > But I'm not convinced yet. Do you find something new for this? Thanks Tao ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: do not drain pagevecs for mlock 2012-01-06 6:13 ` [PATCH] mm: do not drain pagevecs for mlock Tao Ma @ 2012-01-06 6:18 ` KOSAKI Motohiro 2012-01-06 6:30 ` Tao Ma 0 siblings, 1 reply; 28+ messages in thread From: KOSAKI Motohiro @ 2012-01-06 6:18 UTC (permalink / raw) To: Tao Ma Cc: linux-mm, linux-kernel, David Rientjes, Minchan Kim, Mel Gorman, Johannes Weiner, Andrew Morton 2012/1/6 Tao Ma <tm@tao.ma>: > Hi Kosaki, > On 12/30/2011 06:07 PM, KOSAKI Motohiro wrote: >>>> Because your test program is too artificial. 20sec/100000times = >>>> 200usec. And your >>>> program repeat mlock and munlock the exact same address. so, yes, if >>>> lru_add_drain_all() is removed, it become near no-op. but it's >>>> worthless comparision. >>>> none of any practical program does such strange mlock usage. >>> yes, I should say it is artificial. But mlock did cause the problem in >>> our product system and perf shows that the mlock uses the system time >>> much more than others. That's the reason we created this program to test >>> whether mlock really sucks. And we compared the result with >>> rhel5(2.6.18) which runs much much faster. >>> >>> And from the commit log you described, we can remove lru_add_drain_all >>> safely here, so why add it? At least removing it makes mlock much faster >>> compared to the vanilla kernel. >> >> If we remove it, we lose to a test way of mlock. "Memlocked" field of >> /proc/meminfo >> show inaccurate number very easily. So, if 200usec is no avoidable, >> I'll ack you. >> But I'm not convinced yet. > Do you find something new for this? No. Or more exactly, 200usec is my calculation mistake. your program call mlock 3 times per each iteration. so, correct cost is 66usec. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: do not drain pagevecs for mlock 2012-01-06 6:18 ` KOSAKI Motohiro @ 2012-01-06 6:30 ` Tao Ma 2012-01-06 6:33 ` KOSAKI Motohiro 0 siblings, 1 reply; 28+ messages in thread From: Tao Ma @ 2012-01-06 6:30 UTC (permalink / raw) To: KOSAKI Motohiro Cc: linux-mm, linux-kernel, David Rientjes, Minchan Kim, Mel Gorman, Johannes Weiner, Andrew Morton On 01/06/2012 02:18 PM, KOSAKI Motohiro wrote: > 2012/1/6 Tao Ma <tm@tao.ma>: >> Hi Kosaki, >> On 12/30/2011 06:07 PM, KOSAKI Motohiro wrote: >>>>> Because your test program is too artificial. 20sec/100000times = >>>>> 200usec. And your >>>>> program repeat mlock and munlock the exact same address. so, yes, if >>>>> lru_add_drain_all() is removed, it become near no-op. but it's >>>>> worthless comparision. >>>>> none of any practical program does such strange mlock usage. >>>> yes, I should say it is artificial. But mlock did cause the problem in >>>> our product system and perf shows that the mlock uses the system time >>>> much more than others. That's the reason we created this program to test >>>> whether mlock really sucks. And we compared the result with >>>> rhel5(2.6.18) which runs much much faster. >>>> >>>> And from the commit log you described, we can remove lru_add_drain_all >>>> safely here, so why add it? At least removing it makes mlock much faster >>>> compared to the vanilla kernel. >>> >>> If we remove it, we lose to a test way of mlock. "Memlocked" field of >>> /proc/meminfo >>> show inaccurate number very easily. So, if 200usec is no avoidable, >>> I'll ack you. >>> But I'm not convinced yet. >> Do you find something new for this? > > No. > > Or more exactly, 200usec is my calculation mistake. your program call mlock > 3 times per each iteration. so, correct cost is 66usec. yes, so mlock can do 15000/s, it is even slower than the whole i/o time for some not very fast ssd disk and I don't think it is endurable. I guess we should remove it, right? Or you have another other suggestion that I can try for it? Thanks Tao ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: do not drain pagevecs for mlock 2012-01-06 6:30 ` Tao Ma @ 2012-01-06 6:33 ` KOSAKI Motohiro 2012-01-06 6:46 ` Tao Ma 0 siblings, 1 reply; 28+ messages in thread From: KOSAKI Motohiro @ 2012-01-06 6:33 UTC (permalink / raw) To: Tao Ma Cc: KOSAKI Motohiro, linux-mm, linux-kernel, David Rientjes, Minchan Kim, Mel Gorman, Johannes Weiner, Andrew Morton (1/6/12 1:30 AM), Tao Ma wrote: > On 01/06/2012 02:18 PM, KOSAKI Motohiro wrote: >> 2012/1/6 Tao Ma<tm@tao.ma>: >>> Hi Kosaki, >>> On 12/30/2011 06:07 PM, KOSAKI Motohiro wrote: >>>>>> Because your test program is too artificial. 20sec/100000times = >>>>>> 200usec. And your >>>>>> program repeat mlock and munlock the exact same address. so, yes, if >>>>>> lru_add_drain_all() is removed, it become near no-op. but it's >>>>>> worthless comparision. >>>>>> none of any practical program does such strange mlock usage. >>>>> yes, I should say it is artificial. But mlock did cause the problem in >>>>> our product system and perf shows that the mlock uses the system time >>>>> much more than others. That's the reason we created this program to test >>>>> whether mlock really sucks. And we compared the result with >>>>> rhel5(2.6.18) which runs much much faster. >>>>> >>>>> And from the commit log you described, we can remove lru_add_drain_all >>>>> safely here, so why add it? At least removing it makes mlock much faster >>>>> compared to the vanilla kernel. >>>> >>>> If we remove it, we lose to a test way of mlock. "Memlocked" field of >>>> /proc/meminfo >>>> show inaccurate number very easily. So, if 200usec is no avoidable, >>>> I'll ack you. >>>> But I'm not convinced yet. >>> Do you find something new for this? >> >> No. >> >> Or more exactly, 200usec is my calculation mistake. your program call mlock >> 3 times per each iteration. so, correct cost is 66usec. > yes, so mlock can do 15000/s, it is even slower than the whole i/o time > for some not very fast ssd disk and I don't think it is endurable. I > guess we should remove it, right? Or you have another other suggestion > that I can try for it? read whole thread. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: do not drain pagevecs for mlock 2012-01-06 6:33 ` KOSAKI Motohiro @ 2012-01-06 6:46 ` Tao Ma 2012-01-09 23:58 ` KOSAKI Motohiro 0 siblings, 1 reply; 28+ messages in thread From: Tao Ma @ 2012-01-06 6:46 UTC (permalink / raw) To: KOSAKI Motohiro Cc: KOSAKI Motohiro, linux-mm, linux-kernel, David Rientjes, Minchan Kim, Mel Gorman, Johannes Weiner, Andrew Morton On 01/06/2012 02:33 PM, KOSAKI Motohiro wrote: > (1/6/12 1:30 AM), Tao Ma wrote: >> On 01/06/2012 02:18 PM, KOSAKI Motohiro wrote: >>> 2012/1/6 Tao Ma<tm@tao.ma>: >>>> Hi Kosaki, >>>> On 12/30/2011 06:07 PM, KOSAKI Motohiro wrote: >>>>>>> Because your test program is too artificial. 20sec/100000times = >>>>>>> 200usec. And your >>>>>>> program repeat mlock and munlock the exact same address. so, yes, if >>>>>>> lru_add_drain_all() is removed, it become near no-op. but it's >>>>>>> worthless comparision. >>>>>>> none of any practical program does such strange mlock usage. >>>>>> yes, I should say it is artificial. But mlock did cause the >>>>>> problem in >>>>>> our product system and perf shows that the mlock uses the system time >>>>>> much more than others. That's the reason we created this program >>>>>> to test >>>>>> whether mlock really sucks. And we compared the result with >>>>>> rhel5(2.6.18) which runs much much faster. >>>>>> >>>>>> And from the commit log you described, we can remove >>>>>> lru_add_drain_all >>>>>> safely here, so why add it? At least removing it makes mlock much >>>>>> faster >>>>>> compared to the vanilla kernel. >>>>> >>>>> If we remove it, we lose to a test way of mlock. "Memlocked" field of >>>>> /proc/meminfo >>>>> show inaccurate number very easily. So, if 200usec is no avoidable, >>>>> I'll ack you. >>>>> But I'm not convinced yet. >>>> Do you find something new for this? >>> >>> No. >>> >>> Or more exactly, 200usec is my calculation mistake. your program call >>> mlock >>> 3 times per each iteration. so, correct cost is 66usec. >> yes, so mlock can do 15000/s, it is even slower than the whole i/o time >> for some not very fast ssd disk and I don't think it is endurable. I >> guess we should remove it, right? Or you have another other suggestion >> that I can try for it? > > read whole thread. I have read the whole thread, and you just described that the test case is artificial and there is no suggestion or patch about how to resolve it. As I have said that it is very time-consuming and with more cpu cores, the more penalty, and an i/o time for a ssd can be faster than it. So do you think 66 usec is OK for a memory operation? Thanks Tao ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: do not drain pagevecs for mlock 2012-01-06 6:46 ` Tao Ma @ 2012-01-09 23:58 ` KOSAKI Motohiro 2012-01-10 2:08 ` Tao Ma 0 siblings, 1 reply; 28+ messages in thread From: KOSAKI Motohiro @ 2012-01-09 23:58 UTC (permalink / raw) To: Tao Ma Cc: KOSAKI Motohiro, linux-mm, linux-kernel, David Rientjes, Minchan Kim, Mel Gorman, Johannes Weiner, Andrew Morton (1/6/12 1:46 AM), Tao Ma wrote: > On 01/06/2012 02:33 PM, KOSAKI Motohiro wrote: >> (1/6/12 1:30 AM), Tao Ma wrote: >>> On 01/06/2012 02:18 PM, KOSAKI Motohiro wrote: >>>> 2012/1/6 Tao Ma<tm@tao.ma>: >>>>> Hi Kosaki, >>>>> On 12/30/2011 06:07 PM, KOSAKI Motohiro wrote: >>>>>>>> Because your test program is too artificial. 20sec/100000times = >>>>>>>> 200usec. And your >>>>>>>> program repeat mlock and munlock the exact same address. so, yes, if >>>>>>>> lru_add_drain_all() is removed, it become near no-op. but it's >>>>>>>> worthless comparision. >>>>>>>> none of any practical program does such strange mlock usage. >>>>>>> yes, I should say it is artificial. But mlock did cause the >>>>>>> problem in >>>>>>> our product system and perf shows that the mlock uses the system time >>>>>>> much more than others. That's the reason we created this program >>>>>>> to test >>>>>>> whether mlock really sucks. And we compared the result with >>>>>>> rhel5(2.6.18) which runs much much faster. >>>>>>> >>>>>>> And from the commit log you described, we can remove >>>>>>> lru_add_drain_all >>>>>>> safely here, so why add it? At least removing it makes mlock much >>>>>>> faster >>>>>>> compared to the vanilla kernel. >>>>>> >>>>>> If we remove it, we lose to a test way of mlock. "Memlocked" field of >>>>>> /proc/meminfo >>>>>> show inaccurate number very easily. So, if 200usec is no avoidable, >>>>>> I'll ack you. >>>>>> But I'm not convinced yet. >>>>> Do you find something new for this? >>>> >>>> No. >>>> >>>> Or more exactly, 200usec is my calculation mistake. your program call >>>> mlock >>>> 3 times per each iteration. so, correct cost is 66usec. >>> yes, so mlock can do 15000/s, it is even slower than the whole i/o time >>> for some not very fast ssd disk and I don't think it is endurable. I >>> guess we should remove it, right? Or you have another other suggestion >>> that I can try for it? >> >> read whole thread. > I have read the whole thread, and you just described that the test case > is artificial and there is no suggestion or patch about how to resolve > it. As I have said that it is very time-consuming and with more cpu > cores, the more penalty, and an i/o time for a ssd can be faster than > it. So do you think 66 usec is OK for a memory operation? I don't think you've read the thread at all. please read akpm's commnet. http://www.spinics.net/lists/linux-mm/msg28290.html ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: do not drain pagevecs for mlock 2012-01-09 23:58 ` KOSAKI Motohiro @ 2012-01-10 2:08 ` Tao Ma 0 siblings, 0 replies; 28+ messages in thread From: Tao Ma @ 2012-01-10 2:08 UTC (permalink / raw) To: KOSAKI Motohiro Cc: KOSAKI Motohiro, linux-mm, linux-kernel, David Rientjes, Minchan Kim, Mel Gorman, Johannes Weiner, Andrew Morton On 01/10/2012 07:58 AM, KOSAKI Motohiro wrote: > (1/6/12 1:46 AM), Tao Ma wrote: >> On 01/06/2012 02:33 PM, KOSAKI Motohiro wrote: >>> (1/6/12 1:30 AM), Tao Ma wrote: >>>> On 01/06/2012 02:18 PM, KOSAKI Motohiro wrote: >>>>> 2012/1/6 Tao Ma<tm@tao.ma>: >>>>>> Hi Kosaki, >>>>>> On 12/30/2011 06:07 PM, KOSAKI Motohiro wrote: >>>>>>>>> Because your test program is too artificial. 20sec/100000times = >>>>>>>>> 200usec. And your >>>>>>>>> program repeat mlock and munlock the exact same address. so, >>>>>>>>> yes, if >>>>>>>>> lru_add_drain_all() is removed, it become near no-op. but it's >>>>>>>>> worthless comparision. >>>>>>>>> none of any practical program does such strange mlock usage. >>>>>>>> yes, I should say it is artificial. But mlock did cause the >>>>>>>> problem in >>>>>>>> our product system and perf shows that the mlock uses the system >>>>>>>> time >>>>>>>> much more than others. That's the reason we created this program >>>>>>>> to test >>>>>>>> whether mlock really sucks. And we compared the result with >>>>>>>> rhel5(2.6.18) which runs much much faster. >>>>>>>> >>>>>>>> And from the commit log you described, we can remove >>>>>>>> lru_add_drain_all >>>>>>>> safely here, so why add it? At least removing it makes mlock much >>>>>>>> faster >>>>>>>> compared to the vanilla kernel. >>>>>>> >>>>>>> If we remove it, we lose to a test way of mlock. "Memlocked" >>>>>>> field of >>>>>>> /proc/meminfo >>>>>>> show inaccurate number very easily. So, if 200usec is no avoidable, >>>>>>> I'll ack you. >>>>>>> But I'm not convinced yet. >>>>>> Do you find something new for this? >>>>> >>>>> No. >>>>> >>>>> Or more exactly, 200usec is my calculation mistake. your program call >>>>> mlock >>>>> 3 times per each iteration. so, correct cost is 66usec. >>>> yes, so mlock can do 15000/s, it is even slower than the whole i/o time >>>> for some not very fast ssd disk and I don't think it is endurable. I >>>> guess we should remove it, right? Or you have another other suggestion >>>> that I can try for it? >>> >>> read whole thread. >> I have read the whole thread, and you just described that the test case >> is artificial and there is no suggestion or patch about how to resolve >> it. As I have said that it is very time-consuming and with more cpu >> cores, the more penalty, and an i/o time for a ssd can be faster than >> it. So do you think 66 usec is OK for a memory operation? > > I don't think you've read the thread at all. please read akpm's commnet. > > http://www.spinics.net/lists/linux-mm/msg28290.html Oh, your patch set doesn't cc to me, so my mail filter moved it to another directory.. Sorry and I will read the whole thread. Thanks again for your time. Thanks Tao ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: do not drain pagevecs for mlock 2011-12-30 10:07 ` KOSAKI Motohiro ` (2 preceding siblings ...) 2012-01-06 6:13 ` [PATCH] mm: do not drain pagevecs for mlock Tao Ma @ 2012-01-09 7:25 ` Tao Ma 3 siblings, 0 replies; 28+ messages in thread From: Tao Ma @ 2012-01-09 7:25 UTC (permalink / raw) To: KOSAKI Motohiro Cc: linux-mm, linux-kernel, David Rientjes, Minchan Kim, Mel Gorman, Johannes Weiner, Andrew Morton Hi KOSAKI, On 12/30/2011 06:07 PM, KOSAKI Motohiro wrote: >>> Because your test program is too artificial. 20sec/100000times = >>> 200usec. And your >>> program repeat mlock and munlock the exact same address. so, yes, if >>> lru_add_drain_all() is removed, it become near no-op. but it's >>> worthless comparision. >>> none of any practical program does such strange mlock usage. >> yes, I should say it is artificial. But mlock did cause the problem in >> our product system and perf shows that the mlock uses the system time >> much more than others. That's the reason we created this program to test >> whether mlock really sucks. And we compared the result with >> rhel5(2.6.18) which runs much much faster. >> >> And from the commit log you described, we can remove lru_add_drain_all >> safely here, so why add it? At least removing it makes mlock much faster >> compared to the vanilla kernel. > > If we remove it, we lose to a test way of mlock. "Memlocked" field of > /proc/meminfo > show inaccurate number very easily. So, if 200usec is no avoidable, > I'll ack you. > But I'm not convinced yet. As you don't think removing lru_add_drain_all is appropriate, I have created another patch set to resolve it. I add a new per cpu counter to record the counter of all the pages in the pagevecs. So if the counter is 0, don't drain the corresponding cpu. Does it make sense to you? Thanks Tao ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] mm: do not drain pagevecs for mlock 2011-12-30 9:45 ` Tao Ma 2011-12-30 10:07 ` KOSAKI Motohiro @ 2011-12-30 10:14 ` KOSAKI Motohiro 1 sibling, 0 replies; 28+ messages in thread From: KOSAKI Motohiro @ 2011-12-30 10:14 UTC (permalink / raw) To: Tao Ma Cc: linux-mm, linux-kernel, David Rientjes, Minchan Kim, Mel Gorman, Johannes Weiner, Andrew Morton One more thing. >> Because your test program is too artificial. 20sec/100000times = >> 200usec. And your >> program repeat mlock and munlock the exact same address. so, yes, if >> lru_add_drain_all() is removed, it become near no-op. but it's >> worthless comparision. >> none of any practical program does such strange mlock usage. > yes, I should say it is artificial. But mlock did cause the problem in > our product system and perf shows that the mlock uses the system time > much more than others. That's the reason we created this program to test > whether mlock really sucks. And we compared the result with > rhel5(2.6.18) which runs much much faster. rhel5 is faster because it doesn't have proper mlock implementation. then it has a lot of mlock related performance issue. We optimized PAGE_MLOCK thing for typical workload. number of mlock call are rare and number of memory reclaim are a lot. That's the reason why we haven't got any complication of mlock thing. ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2012-01-10 8:53 UTC | newest] Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-12-30 6:36 [PATCH] mm: do not drain pagevecs for mlock Tao Ma 2011-12-30 8:11 ` KOSAKI Motohiro 2011-12-30 8:48 ` Tao Ma 2011-12-30 9:31 ` KOSAKI Motohiro 2011-12-30 9:45 ` Tao Ma 2011-12-30 10:07 ` KOSAKI Motohiro 2012-01-01 7:30 ` [PATCH 1/2] mm,mlock: drain pagevecs asynchronously kosaki.motohiro 2012-01-04 1:17 ` Minchan Kim 2012-01-04 2:38 ` KOSAKI Motohiro 2012-01-10 8:53 ` Tao Ma 2012-01-04 2:56 ` Hugh Dickins 2012-01-04 22:05 ` Andrew Morton 2012-01-04 23:33 ` KOSAKI Motohiro 2012-01-05 0:19 ` Hugh Dickins 2012-01-01 7:30 ` [PATCH 2/2] sysvshm: SHM_LOCK use lru_add_drain_all_async() kosaki.motohiro 2012-01-04 1:51 ` Hugh Dickins 2012-01-04 2:19 ` KOSAKI Motohiro 2012-01-04 5:17 ` Hugh Dickins 2012-01-04 8:34 ` KOSAKI Motohiro 2012-01-06 6:13 ` [PATCH] mm: do not drain pagevecs for mlock Tao Ma 2012-01-06 6:18 ` KOSAKI Motohiro 2012-01-06 6:30 ` Tao Ma 2012-01-06 6:33 ` KOSAKI Motohiro 2012-01-06 6:46 ` Tao Ma 2012-01-09 23:58 ` KOSAKI Motohiro 2012-01-10 2:08 ` Tao Ma 2012-01-09 7:25 ` Tao Ma 2011-12-30 10:14 ` KOSAKI Motohiro
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).