From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752125Ab3BRKgf (ORCPT ); Mon, 18 Feb 2013 05:36:35 -0500 Received: from cn.fujitsu.com ([222.73.24.84]:32642 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1750880Ab3BRKgd (ORCPT ); Mon, 18 Feb 2013 05:36:33 -0500 X-IronPort-AV: E=Sophos;i="4.84,686,1355068800"; d="scan'208";a="6724141" Message-ID: <512203C4.8010608@cn.fujitsu.com> Date: Mon, 18 Feb 2013 18:34:44 +0800 From: Lin Feng User-Agent: Mozilla/5.0 (X11; Linux i686; rv:15.0) Gecko/20120911 Thunderbird/15.0.1 MIME-Version: 1.0 To: Mel Gorman CC: Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> In-Reply-To: <20130205115722.GF21389@suse.de> X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/18 18:35:39, Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at 2013/02/18 18:35:40, Serialize complete at 2013/02/18 18:35:40 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-15 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Mel, See below. On 02/05/2013 07:57 PM, Mel Gorman wrote: > On Mon, Feb 04, 2013 at 04:06:24PM -0800, Andrew Morton wrote: >> The ifdefs aren't really needed here and I encourage people to omit >> them. This keeps the header files looking neater and reduces the >> chances of things later breaking because we forgot to update some >> CONFIG_foo logic in a header file. The downside is that errors will be >> revealed at link time rather than at compile time, but that's a pretty >> small cost. >> > > As an aside, if ifdefs *have* to be used then it often better to have a > pattern like > > #ifdef CONFIG_MEMORY_HOTREMOVE > int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, > unsigned long start, int nr_pages, int write, int force, > struct page **pages, struct vm_area_struct **vmas); > #else > static inline get_user_pages_non_movable(...) > { > get_user_pages(...) > } > #endif > > It eliminates the need for #ifdefs in the C file that calls > get_user_pages_non_movable(). Thanks for pointing out :) >>> + >>> +retry: >>> + ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, >>> + vmas); >> >> We should handle (ret < 0) here. At present the function will silently >> convert an error return into "return 0", which is bad. The function >> does appear to otherwise do the right thing if get_user_pages() failed, >> but only due to good luck. >> > > A BUG_ON check if we retry more than once wouldn't hurt either. It requires > a broken implementation of alloc_migrate_target() but someone might "fix" > it and miss this. Agree. > > A minor aside, it would be nice if we exited at this point if there was > no populated ZONE_MOVABLE in the system. There is a movable_zone global > variable already. If it was forced to be MAX_NR_ZONES before the call to > find_usable_zone_for_movable() in memory initialisation we should be able > to make a cheap check for it here. OK. >>> + if (!migrate_pre_flag) { >>> + if (migrate_prep()) >>> + goto put_page; > > CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never > return failure. You could make this BUG_ON(migrate_prep()), remove this > goto and the migrate_pre_flag check below becomes redundant. Sorry, I don't quite catch on this. Without migrate_pre_flag, the BUG_ON() check will call/check migrate_prep() every time we isolate a single page, do we have to do so? I mean that for a group of pages it's sufficient to call migrate_prep() only once. There are some other migrate-relative codes using this scheme. So I get confused, do I miss something or the the performance cost is little? :( > >>> + migrate_pre_flag = 1; >>> + } >>> + >>> + if (!isolate_lru_page(pages[i])) { >>> + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + >>> + page_is_file_cache(pages[i])); >>> + list_add_tail(&pages[i]->lru, &pagelist); >>> + } else { >>> + isolate_err = 1; >>> + goto put_page; >>> + } > > isolate_lru_page() takes the LRU lock every time. If > get_user_pages_non_movable is used heavily then you may encounter lock > contention problems. Batching this lock would be a separate patch and should > not be necessary yet but you should at least comment on it as a reminder. Farsighted, should improve it in the future. > > I think that a goto could also have been avoided here if you used break. The > i == ret check below would be false and it would just fall through. > Overall the flow would be a bit easier to follow. Yes, I noticed this before. I thought using goto could save some micro ops (here is only the i == ret check) though lower the readability but still be clearly. Also there are some other place in current kernel facing such performance/readability race condition, but it seems that the developers prefer readability, why? While I think performance is fatal to kernel.. > > Why list_add_tail()? I don't think it's wrong but it's unusual to see > list_add_tail() when list_add() is enough. Sorry, not intentional, just steal from other codes without thinking more ;-) > >>> + } >>> + } >>> + >>> + /* All pages are non movable, we are done :) */ >>> + if (i == ret && list_empty(&pagelist)) >>> + return ret; >>> + >>> +put_page: >>> + /* Undo the effects of former get_user_pages(), we won't pin anything */ >>> + for (i = 0; i < ret; i++) >>> + put_page(pages[i]); >>> + > > release_pages. > > That comment is insufficient. There are non-obvious consequences to this > logic. We are dropping pins on all pages regardless of what zone they > are in. If the subsequent migration fails then we end up returning 0 > with no pages pinned. The user-visible effect is that io_setup() fails > for non-obvious reasons. It will return EAGAIN to userspace which will be > interpreted as "The specified nr_events exceeds the user's limit of available > events.". The application will either fail or potentially infinite loop > if the developer interpreted EAGAIN as "try again" as opposed to "this is > a permanent failure". Yes, I missed such things. > > Is that deliberate? Is it really preferable that AIO can fail to setup > and the application exit just in case we want to hot-remove memory later? > Should a failed migration generate a WARN_ON at least? > > I would think that it's better to WARN_ON_ONCE if migration fails but pin > the pages as requested. If a future memory hot-remove operation fails > then the warning will indicate why but applications will not fail as a I'm so agree, I will keep the pin state alive and issue a WARN_ON_ONCE message in such case. > result. It's a little clumsy but the memory hot-remove failure message > could list what applications have pinned the pages that cannot be removed > so the administrator has the option of force-killing the application. It > is possible to discover what application is pinning a page from userspace > but it would involve an expensive search with /proc/kpagemap > >>> + if (migrate_pre_flag && !isolate_err) { >>> + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, >>> + false, MIGRATE_SYNC, MR_SYSCALL); > > The conversion of alloc_migrate_target is a bit problematic. It strips > the __GFP_MOVABLE flag and the consequence of this is that it converts > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large > number of pages, particularly if the number of get_user_pages_non_movable() > increases for short-lived pins like direct IO. Sorry, I don't quite understand here neither. If we use the following new migration allocation function as you said, the increasing number of get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE pages. What's the difference, do I miss something? > > One way around this is to add a high_zoneidx parameter to > __alloc_pages_nodemask and rename it ____alloc_pages_nodemask, create a new > inline function __alloc_pages_nodemask that passes in gfp_zone(gfp_mask) > as the high_zoneidx and create a new migration allocation function that > passes on ZONE_HIGHMEM as high_zoneidx. That would force the allocation to > happen in a lower zone while still treating the allocation as MIGRATE_MOVABLE On 02/05/2013 09:32 PM, Mel Gorman wrote: > Credit to Michal Hocko for bringing this up but with the number of > other issues I missed that this is also broken with respect to huge page > handling. hugetlbfs pages will not be on the LRU so the isolation will mess > up and the migration has to be handled differently. Ordinarily hugetlbfs > pages cannot be allocated from ZONE_MOVABLE but it is possible to configure > it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this > encounters a hugetlbfs page, it'll just blow up. > > The other is that this almost certainly broken for transhuge page > handling. gup returns the head and tail pages and ordinarily this is ok > because the caller only cares about the physical address. Migration will > also split a hugepage if it receives it but you are potentially adding > tail pages to a list here and then migrating them. The split of the first > page will get very confused. I'm not exactly sure what the result will be > but it won't be pretty. > > Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled > during testing? Sorry, I hadn't considered THP and hugepage yet, I will deal such cases based on your comments later. thanks a lot for your review :) linfeng From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lin Feng Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() Date: Mon, 18 Feb 2013 18:34:44 +0800 Message-ID: <512203C4.8010608@cn.fujitsu.com> References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Cc: Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org To: Mel Gorman Return-path: In-Reply-To: <20130205115722.GF21389@suse.de> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hi Mel, See below. On 02/05/2013 07:57 PM, Mel Gorman wrote: > On Mon, Feb 04, 2013 at 04:06:24PM -0800, Andrew Morton wrote: >> The ifdefs aren't really needed here and I encourage people to omit >> them. This keeps the header files looking neater and reduces the >> chances of things later breaking because we forgot to update some >> CONFIG_foo logic in a header file. The downside is that errors will be >> revealed at link time rather than at compile time, but that's a pretty >> small cost. >> > > As an aside, if ifdefs *have* to be used then it often better to have a > pattern like > > #ifdef CONFIG_MEMORY_HOTREMOVE > int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, > unsigned long start, int nr_pages, int write, int force, > struct page **pages, struct vm_area_struct **vmas); > #else > static inline get_user_pages_non_movable(...) > { > get_user_pages(...) > } > #endif > > It eliminates the need for #ifdefs in the C file that calls > get_user_pages_non_movable(). Thanks for pointing out :) >>> + >>> +retry: >>> + ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, >>> + vmas); >> >> We should handle (ret < 0) here. At present the function will silently >> convert an error return into "return 0", which is bad. The function >> does appear to otherwise do the right thing if get_user_pages() failed, >> but only due to good luck. >> > > A BUG_ON check if we retry more than once wouldn't hurt either. It requires > a broken implementation of alloc_migrate_target() but someone might "fix" > it and miss this. Agree. > > A minor aside, it would be nice if we exited at this point if there was > no populated ZONE_MOVABLE in the system. There is a movable_zone global > variable already. If it was forced to be MAX_NR_ZONES before the call to > find_usable_zone_for_movable() in memory initialisation we should be able > to make a cheap check for it here. OK. >>> + if (!migrate_pre_flag) { >>> + if (migrate_prep()) >>> + goto put_page; > > CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never > return failure. You could make this BUG_ON(migrate_prep()), remove this > goto and the migrate_pre_flag check below becomes redundant. Sorry, I don't quite catch on this. Without migrate_pre_flag, the BUG_ON() check will call/check migrate_prep() every time we isolate a single page, do we have to do so? I mean that for a group of pages it's sufficient to call migrate_prep() only once. There are some other migrate-relative codes using this scheme. So I get confused, do I miss something or the the performance cost is little? :( > >>> + migrate_pre_flag = 1; >>> + } >>> + >>> + if (!isolate_lru_page(pages[i])) { >>> + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + >>> + page_is_file_cache(pages[i])); >>> + list_add_tail(&pages[i]->lru, &pagelist); >>> + } else { >>> + isolate_err = 1; >>> + goto put_page; >>> + } > > isolate_lru_page() takes the LRU lock every time. If > get_user_pages_non_movable is used heavily then you may encounter lock > contention problems. Batching this lock would be a separate patch and should > not be necessary yet but you should at least comment on it as a reminder. Farsighted, should improve it in the future. > > I think that a goto could also have been avoided here if you used break. The > i == ret check below would be false and it would just fall through. > Overall the flow would be a bit easier to follow. Yes, I noticed this before. I thought using goto could save some micro ops (here is only the i == ret check) though lower the readability but still be clearly. Also there are some other place in current kernel facing such performance/readability race condition, but it seems that the developers prefer readability, why? While I think performance is fatal to kernel.. > > Why list_add_tail()? I don't think it's wrong but it's unusual to see > list_add_tail() when list_add() is enough. Sorry, not intentional, just steal from other codes without thinking more ;-) > >>> + } >>> + } >>> + >>> + /* All pages are non movable, we are done :) */ >>> + if (i == ret && list_empty(&pagelist)) >>> + return ret; >>> + >>> +put_page: >>> + /* Undo the effects of former get_user_pages(), we won't pin anything */ >>> + for (i = 0; i < ret; i++) >>> + put_page(pages[i]); >>> + > > release_pages. > > That comment is insufficient. There are non-obvious consequences to this > logic. We are dropping pins on all pages regardless of what zone they > are in. If the subsequent migration fails then we end up returning 0 > with no pages pinned. The user-visible effect is that io_setup() fails > for non-obvious reasons. It will return EAGAIN to userspace which will be > interpreted as "The specified nr_events exceeds the user's limit of available > events.". The application will either fail or potentially infinite loop > if the developer interpreted EAGAIN as "try again" as opposed to "this is > a permanent failure". Yes, I missed such things. > > Is that deliberate? Is it really preferable that AIO can fail to setup > and the application exit just in case we want to hot-remove memory later? > Should a failed migration generate a WARN_ON at least? > > I would think that it's better to WARN_ON_ONCE if migration fails but pin > the pages as requested. If a future memory hot-remove operation fails > then the warning will indicate why but applications will not fail as a I'm so agree, I will keep the pin state alive and issue a WARN_ON_ONCE message in such case. > result. It's a little clumsy but the memory hot-remove failure message > could list what applications have pinned the pages that cannot be removed > so the administrator has the option of force-killing the application. It > is possible to discover what application is pinning a page from userspace > but it would involve an expensive search with /proc/kpagemap > >>> + if (migrate_pre_flag && !isolate_err) { >>> + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, >>> + false, MIGRATE_SYNC, MR_SYSCALL); > > The conversion of alloc_migrate_target is a bit problematic. It strips > the __GFP_MOVABLE flag and the consequence of this is that it converts > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large > number of pages, particularly if the number of get_user_pages_non_movable() > increases for short-lived pins like direct IO. Sorry, I don't quite understand here neither. If we use the following new migration allocation function as you said, the increasing number of get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE pages. What's the difference, do I miss something? > > One way around this is to add a high_zoneidx parameter to > __alloc_pages_nodemask and rename it ____alloc_pages_nodemask, create a new > inline function __alloc_pages_nodemask that passes in gfp_zone(gfp_mask) > as the high_zoneidx and create a new migration allocation function that > passes on ZONE_HIGHMEM as high_zoneidx. That would force the allocation to > happen in a lower zone while still treating the allocation as MIGRATE_MOVABLE On 02/05/2013 09:32 PM, Mel Gorman wrote: > Credit to Michal Hocko for bringing this up but with the number of > other issues I missed that this is also broken with respect to huge page > handling. hugetlbfs pages will not be on the LRU so the isolation will mess > up and the migration has to be handled differently. Ordinarily hugetlbfs > pages cannot be allocated from ZONE_MOVABLE but it is possible to configure > it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this > encounters a hugetlbfs page, it'll just blow up. > > The other is that this almost certainly broken for transhuge page > handling. gup returns the head and tail pages and ordinarily this is ok > because the caller only cares about the physical address. Migration will > also split a hugepage if it receives it but you are potentially adding > tail pages to a list here and then migrating them. The split of the first > page will get very confused. I'm not exactly sure what the result will be > but it won't be pretty. > > Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled > during testing? Sorry, I hadn't considered THP and hugepage yet, I will deal such cases based on your comments later. thanks a lot for your review :) linfeng -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <512203C4.8010608@cn.fujitsu.com> Date: Mon, 18 Feb 2013 18:34:44 +0800 From: Lin Feng MIME-Version: 1.0 Subject: Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable() References: <1359972248-8722-1-git-send-email-linfeng@cn.fujitsu.com> <1359972248-8722-2-git-send-email-linfeng@cn.fujitsu.com> <20130204160624.5c20a8a0.akpm@linux-foundation.org> <20130205115722.GF21389@suse.de> In-Reply-To: <20130205115722.GF21389@suse.de> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=ISO-8859-15 Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Andrew Morton , bcrl@kvack.org, viro@zeniv.linux.org.uk, khlebnikov@openvz.org, walken@google.com, kamezawa.hiroyu@jp.fujitsu.com, minchan@kernel.org, riel@redhat.com, rientjes@google.com, isimatu.yasuaki@jp.fujitsu.com, wency@cn.fujitsu.com, laijs@cn.fujitsu.com, jiang.liu@huawei.com, linux-mm@kvack.org, linux-aio@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Hi Mel, See below. On 02/05/2013 07:57 PM, Mel Gorman wrote: > On Mon, Feb 04, 2013 at 04:06:24PM -0800, Andrew Morton wrote: >> The ifdefs aren't really needed here and I encourage people to omit >> them. This keeps the header files looking neater and reduces the >> chances of things later breaking because we forgot to update some >> CONFIG_foo logic in a header file. The downside is that errors will be >> revealed at link time rather than at compile time, but that's a pretty >> small cost. >> > > As an aside, if ifdefs *have* to be used then it often better to have a > pattern like > > #ifdef CONFIG_MEMORY_HOTREMOVE > int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, > unsigned long start, int nr_pages, int write, int force, > struct page **pages, struct vm_area_struct **vmas); > #else > static inline get_user_pages_non_movable(...) > { > get_user_pages(...) > } > #endif > > It eliminates the need for #ifdefs in the C file that calls > get_user_pages_non_movable(). Thanks for pointing out :) >>> + >>> +retry: >>> + ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, >>> + vmas); >> >> We should handle (ret < 0) here. At present the function will silently >> convert an error return into "return 0", which is bad. The function >> does appear to otherwise do the right thing if get_user_pages() failed, >> but only due to good luck. >> > > A BUG_ON check if we retry more than once wouldn't hurt either. It requires > a broken implementation of alloc_migrate_target() but someone might "fix" > it and miss this. Agree. > > A minor aside, it would be nice if we exited at this point if there was > no populated ZONE_MOVABLE in the system. There is a movable_zone global > variable already. If it was forced to be MAX_NR_ZONES before the call to > find_usable_zone_for_movable() in memory initialisation we should be able > to make a cheap check for it here. OK. >>> + if (!migrate_pre_flag) { >>> + if (migrate_prep()) >>> + goto put_page; > > CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never > return failure. You could make this BUG_ON(migrate_prep()), remove this > goto and the migrate_pre_flag check below becomes redundant. Sorry, I don't quite catch on this. Without migrate_pre_flag, the BUG_ON() check will call/check migrate_prep() every time we isolate a single page, do we have to do so? I mean that for a group of pages it's sufficient to call migrate_prep() only once. There are some other migrate-relative codes using this scheme. So I get confused, do I miss something or the the performance cost is little? :( > >>> + migrate_pre_flag = 1; >>> + } >>> + >>> + if (!isolate_lru_page(pages[i])) { >>> + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + >>> + page_is_file_cache(pages[i])); >>> + list_add_tail(&pages[i]->lru, &pagelist); >>> + } else { >>> + isolate_err = 1; >>> + goto put_page; >>> + } > > isolate_lru_page() takes the LRU lock every time. If > get_user_pages_non_movable is used heavily then you may encounter lock > contention problems. Batching this lock would be a separate patch and should > not be necessary yet but you should at least comment on it as a reminder. Farsighted, should improve it in the future. > > I think that a goto could also have been avoided here if you used break. The > i == ret check below would be false and it would just fall through. > Overall the flow would be a bit easier to follow. Yes, I noticed this before. I thought using goto could save some micro ops (here is only the i == ret check) though lower the readability but still be clearly. Also there are some other place in current kernel facing such performance/readability race condition, but it seems that the developers prefer readability, why? While I think performance is fatal to kernel.. > > Why list_add_tail()? I don't think it's wrong but it's unusual to see > list_add_tail() when list_add() is enough. Sorry, not intentional, just steal from other codes without thinking more ;-) > >>> + } >>> + } >>> + >>> + /* All pages are non movable, we are done :) */ >>> + if (i == ret && list_empty(&pagelist)) >>> + return ret; >>> + >>> +put_page: >>> + /* Undo the effects of former get_user_pages(), we won't pin anything */ >>> + for (i = 0; i < ret; i++) >>> + put_page(pages[i]); >>> + > > release_pages. > > That comment is insufficient. There are non-obvious consequences to this > logic. We are dropping pins on all pages regardless of what zone they > are in. If the subsequent migration fails then we end up returning 0 > with no pages pinned. The user-visible effect is that io_setup() fails > for non-obvious reasons. It will return EAGAIN to userspace which will be > interpreted as "The specified nr_events exceeds the user's limit of available > events.". The application will either fail or potentially infinite loop > if the developer interpreted EAGAIN as "try again" as opposed to "this is > a permanent failure". Yes, I missed such things. > > Is that deliberate? Is it really preferable that AIO can fail to setup > and the application exit just in case we want to hot-remove memory later? > Should a failed migration generate a WARN_ON at least? > > I would think that it's better to WARN_ON_ONCE if migration fails but pin > the pages as requested. If a future memory hot-remove operation fails > then the warning will indicate why but applications will not fail as a I'm so agree, I will keep the pin state alive and issue a WARN_ON_ONCE message in such case. > result. It's a little clumsy but the memory hot-remove failure message > could list what applications have pinned the pages that cannot be removed > so the administrator has the option of force-killing the application. It > is possible to discover what application is pinning a page from userspace > but it would involve an expensive search with /proc/kpagemap > >>> + if (migrate_pre_flag && !isolate_err) { >>> + ret = migrate_pages(&pagelist, alloc_migrate_target, 1, >>> + false, MIGRATE_SYNC, MR_SYSCALL); > > The conversion of alloc_migrate_target is a bit problematic. It strips > the __GFP_MOVABLE flag and the consequence of this is that it converts > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large > number of pages, particularly if the number of get_user_pages_non_movable() > increases for short-lived pins like direct IO. Sorry, I don't quite understand here neither. If we use the following new migration allocation function as you said, the increasing number of get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE pages. What's the difference, do I miss something? > > One way around this is to add a high_zoneidx parameter to > __alloc_pages_nodemask and rename it ____alloc_pages_nodemask, create a new > inline function __alloc_pages_nodemask that passes in gfp_zone(gfp_mask) > as the high_zoneidx and create a new migration allocation function that > passes on ZONE_HIGHMEM as high_zoneidx. That would force the allocation to > happen in a lower zone while still treating the allocation as MIGRATE_MOVABLE On 02/05/2013 09:32 PM, Mel Gorman wrote: > Credit to Michal Hocko for bringing this up but with the number of > other issues I missed that this is also broken with respect to huge page > handling. hugetlbfs pages will not be on the LRU so the isolation will mess > up and the migration has to be handled differently. Ordinarily hugetlbfs > pages cannot be allocated from ZONE_MOVABLE but it is possible to configure > it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this > encounters a hugetlbfs page, it'll just blow up. > > The other is that this almost certainly broken for transhuge page > handling. gup returns the head and tail pages and ordinarily this is ok > because the caller only cares about the physical address. Migration will > also split a hugepage if it receives it but you are potentially adding > tail pages to a list here and then migrating them. The split of the first > page will get very confused. I'm not exactly sure what the result will be > but it won't be pretty. > > Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled > during testing? Sorry, I hadn't considered THP and hugepage yet, I will deal such cases based on your comments later. thanks a lot for your review :) linfeng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org