From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752164AbeACJhH (ORCPT + 1 other); Wed, 3 Jan 2018 04:37:07 -0500 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:49558 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751816AbeACJhD (ORCPT ); Wed, 3 Jan 2018 04:37:03 -0500 From: Anshuman Khandual Subject: Re: [RFC PATCH 1/3] mm, numa: rework do_pages_move To: Michal Hocko , Anshuman Khandual References: <20171207143401.GK20234@dhcp22.suse.cz> <20171208161559.27313-1-mhocko@kernel.org> <20171208161559.27313-2-mhocko@kernel.org> <7dd106bd-460a-73a7-bae8-17ffe66a69ee@linux.vnet.ibm.com> <20180103085804.GA11319@dhcp22.suse.cz> Cc: linux-mm@kvack.org, Zi Yan , Naoya Horiguchi , "Kirill A. Shutemov" , Vlastimil Babka , Andrew Morton , Andrea Reale , LKML Date: Wed, 3 Jan 2018 15:06:49 +0530 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.5.1 MIME-Version: 1.0 In-Reply-To: <20180103085804.GA11319@dhcp22.suse.cz> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit X-TM-AS-GCONF: 00 x-cbid: 18010309-0040-0000-0000-000004203011 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18010309-0041-0000-0000-000020C373D6 Message-Id: <32bec0c9-60e2-0362-9446-feb4de1b119c@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2018-01-03_06:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=2 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1801030135 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Return-Path: On 01/03/2018 02:28 PM, Michal Hocko wrote: > On Wed 03-01-18 14:12:17, Anshuman Khandual wrote: >> On 12/08/2017 09:45 PM, Michal Hocko wrote: > [...] [...] >> >> This reuses the existing page allocation helper from migrate_pages() system >> call. But all these allocator helper names for migrate_pages() function are >> really confusing. Even in this case alloc_new_node_page and the original >> new_node_page() which is still getting used in do_migrate_range() sounds >> similar even if their implementation is quite different. IMHO either all of >> them should be moved to the header file with proper differentiating names >> or let them be there in their respective files with these generic names and >> clean them up later. > > I believe I've tried that but I couldn't make them into a single header > file easily because of header file dependencies. I agree that their > names are quite confusing. Feel free to send a patch to clean this up. Sure. Will try once this one gets into mmotm. [...] >> >> >> Just a nit. new_page_node() and store_status() seems different. Then why >> the git diff looks so clumsy. > > Kirill was suggesting to use --patience to general the diff which leads > to a slightly better output. It has been posted as a separate email [1]. > Maybe you will find that one easier to review. > > [1] http://lkml.kernel.org/r/20171213143948.GM25185@dhcp22.suse.cz Yeah it does look better. [...] >>> - return thp; >>> - } else >>> - return __alloc_pages_node(pm->node, >>> - GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0); >>> + err = migrate_pages(pagelist, alloc_new_node_page, NULL, node, >>> + MIGRATE_SYNC, MR_SYSCALL); >>> + if (err) >>> + putback_movable_pages(pagelist); >>> + return err; >>> } >> >> Even this one. IIUC, do_move_pages_to_node() migrate a chunk of pages >> at a time which belong to the same target node. Perhaps the name should >> suggest so. All these helper page migration helper functions sound so >> similar. > > What do you suggest? I find do_move_pages_to_node quite explicit on its > purpose. Sure. Not a big deal. [...] >>> { >>> + struct vm_area_struct *vma; >>> + struct page *page; >>> + unsigned int follflags; >>> int err; >>> - struct page_to_node *pp; >>> - LIST_HEAD(pagelist); >>> >>> down_read(&mm->mmap_sem); >> >> Holding mmap_sem for individual pages makes sense. Current >> implementation is holding it for an entire batch. > > I didn't bother to optimize this path to be honest. It is true that lock > batching can lead to improvements but that would complicate the code > (how many patches to batch?) so I've left that for later if somebody > actually sees any problem. > >>> + err = -EFAULT; >>> + vma = find_vma(mm, addr); >>> + if (!vma || addr < vma->vm_start || !vma_migratable(vma)) >> >> While here, should not we add 'addr > vma->vm_end' into this condition ? > > No. See what find_vma does. > Right. > [...] > > Please cut out the quoted reply to minimum Sure will do. > >>> @@ -1593,79 +1556,80 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, >>> const int __user *nodes, >>> int __user *status, int flags) >>> { >>> - struct page_to_node *pm; >>> - unsigned long chunk_nr_pages; >>> - unsigned long chunk_start; >>> - int err; >>> - >>> - err = -ENOMEM; >>> - pm = (struct page_to_node *)__get_free_page(GFP_KERNEL); >>> - if (!pm) >>> - goto out; >>> + int chunk_node = NUMA_NO_NODE; >>> + LIST_HEAD(pagelist); >>> + int chunk_start, i; >>> + int err = 0, err1; >> >> err init might not be required, its getting assigned to -EFAULT right away. > > No, nr_pages might be 0 AFAICS. Right but there is another err = 0 after the for loop. > > [...] >>> + if (chunk_node == NUMA_NO_NODE) { >>> + chunk_node = node; >>> + chunk_start = i; >>> + } else if (node != chunk_node) { >>> + err = do_move_pages_to_node(mm, &pagelist, chunk_node); >>> + if (err) >>> + goto out; >>> + err = store_status(status, chunk_start, chunk_node, i - chunk_start); >>> + if (err) >>> + goto out; >>> + chunk_start = i; >>> + chunk_node = node; >>> } [...] >>> + err = do_move_pages_to_node(mm, &pagelist, chunk_node); >>> + if (err) >>> + goto out; >>> + if (i > chunk_start) { >>> + err = store_status(status, chunk_start, chunk_node, i - chunk_start); >>> + if (err) >>> + goto out; >>> + } >>> + chunk_node = NUMA_NO_NODE; >> >> This block of code is bit confusing. > > I believe this is easier to grasp when looking at the resulting code. >> >> 1) Why attempt to migrate when just one page could not be isolated ? >> 2) 'i' is always greater than chunk_start except the starting page >> 3) Why reset chunk_node as NUMA_NO_NODE ? > > This is all about flushing the pending state on an error and > distinguising a fresh batch. Okay. Will test it out on a multi node system once I get hold of one.