From mboxrd@z Thu Jan  1 00:00:00 1970
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1752164AbeACJhH (ORCPT <rfc822;ralf@linux-mips.org> + 1 other);
        Wed, 3 Jan 2018 04:37:07 -0500
Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:49558 "EHLO
        mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1751816AbeACJhD (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 3 Jan 2018 04:37:03 -0500
From: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Subject: Re: [RFC PATCH 1/3] mm, numa: rework do_pages_move
To: Michal Hocko <mhocko@kernel.org>,
        Anshuman Khandual <khandual@linux.vnet.ibm.com>
References: <20171207143401.GK20234@dhcp22.suse.cz>
 <20171208161559.27313-1-mhocko@kernel.org>
 <20171208161559.27313-2-mhocko@kernel.org>
 <7dd106bd-460a-73a7-bae8-17ffe66a69ee@linux.vnet.ibm.com>
 <20180103085804.GA11319@dhcp22.suse.cz>
Cc: linux-mm@kvack.org, Zi Yan <zi.yan@cs.rutgers.edu>,
        Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
        "Kirill A. Shutemov" <kirill@shutemov.name>,
        Vlastimil Babka <vbabka@suse.cz>,
        Andrew Morton <akpm@linux-foundation.org>,
        Andrea Reale <ar@linux.vnet.ibm.com>,
        LKML <linux-kernel@vger.kernel.org>
Date: Wed, 3 Jan 2018 15:06:49 +0530
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
 Thunderbird/45.5.1
MIME-Version: 1.0
In-Reply-To: <20180103085804.GA11319@dhcp22.suse.cz>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
X-TM-AS-GCONF: 00
x-cbid: 18010309-0040-0000-0000-000004203011
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 18010309-0041-0000-0000-000020C373D6
Message-Id: <32bec0c9-60e2-0362-9446-feb4de1b119c@linux.vnet.ibm.com>
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2018-01-03_06:,,
 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501
 malwarescore=0 suspectscore=2 phishscore=0 bulkscore=0 spamscore=0
 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0
 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000
 definitions=main-1801030135
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Return-Path: <linux-kernel-owner@vger.kernel.org>

On 01/03/2018 02:28 PM, Michal Hocko wrote:
> On Wed 03-01-18 14:12:17, Anshuman Khandual wrote:
>> On 12/08/2017 09:45 PM, Michal Hocko wrote:
> [...]

[...]

>>
>> This reuses the existing page allocation helper from migrate_pages() system
>> call. But all these allocator helper names for migrate_pages() function are
>> really confusing. Even in this case alloc_new_node_page and the original
>> new_node_page() which is still getting used in do_migrate_range() sounds
>> similar even if their implementation is quite different. IMHO either all of
>> them should be moved to the header file with proper differentiating names
>> or let them be there in their respective files with these generic names and
>> clean them up later.
> 
> I believe I've tried that but I couldn't make them into a single header
> file easily because of header file dependencies. I agree that their
> names are quite confusing. Feel free to send a patch to clean this up.

Sure. Will try once this one gets into mmotm.

[...]

>>
>>
>> Just a nit. new_page_node() and store_status() seems different. Then why
>> the git diff looks so clumsy.
> 
> Kirill was suggesting to use --patience to general the diff which leads
> to a slightly better output. It has been posted as a separate email [1].
> Maybe you will find that one easier to review.
> 
> [1] http://lkml.kernel.org/r/20171213143948.GM25185@dhcp22.suse.cz

Yeah it does look better.

[...]

>>> -		return thp;
>>> -	} else
>>> -		return __alloc_pages_node(pm->node,
>>> -				GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
>>> +	err = migrate_pages(pagelist, alloc_new_node_page, NULL, node,
>>> +			MIGRATE_SYNC, MR_SYSCALL);
>>> +	if (err)
>>> +		putback_movable_pages(pagelist);
>>> +	return err;
>>>  }
>>
>> Even this one. IIUC, do_move_pages_to_node() migrate a chunk of pages
>> at a time which belong to the same target node. Perhaps the name should
>> suggest so. All these helper page migration helper functions sound so
>> similar.
> 
> What do you suggest? I find do_move_pages_to_node quite explicit on its
> purpose.

Sure. Not a big deal.

[...]

>>>  {
>>> +	struct vm_area_struct *vma;
>>> +	struct page *page;
>>> +	unsigned int follflags;
>>>  	int err;
>>> -	struct page_to_node *pp;
>>> -	LIST_HEAD(pagelist);
>>>  
>>>  	down_read(&mm->mmap_sem);
>>
>> Holding mmap_sem for individual pages makes sense. Current
>> implementation is holding it for an entire batch.
> 
> I didn't bother to optimize this path to be honest. It is true that lock
> batching can lead to improvements but that would complicate the code
> (how many patches to batch?) so I've left that for later if somebody
> actually sees any problem.
> 
>>> +	err = -EFAULT;
>>> +	vma = find_vma(mm, addr);
>>> +	if (!vma || addr < vma->vm_start || !vma_migratable(vma))
>>
>> While here, should not we add 'addr > vma->vm_end' into this condition ?
> 
> No. See what find_vma does.
> 

Right.

> [...]
> 
> Please cut out the quoted reply to minimum

Sure will do.

> 
>>> @@ -1593,79 +1556,80 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
>>>  			 const int __user *nodes,
>>>  			 int __user *status, int flags)
>>>  {
>>> -	struct page_to_node *pm;
>>> -	unsigned long chunk_nr_pages;
>>> -	unsigned long chunk_start;
>>> -	int err;
>>> -
>>> -	err = -ENOMEM;
>>> -	pm = (struct page_to_node *)__get_free_page(GFP_KERNEL);
>>> -	if (!pm)
>>> -		goto out;
>>> +	int chunk_node = NUMA_NO_NODE;
>>> +	LIST_HEAD(pagelist);
>>> +	int chunk_start, i;
>>> +	int err = 0, err1;
>>
>> err init might not be required, its getting assigned to -EFAULT right away.
> 
> No, nr_pages might be 0 AFAICS.

Right but there is another err = 0 after the for loop.

> 
> [...]
>>> +		if (chunk_node == NUMA_NO_NODE) {
>>> +			chunk_node = node;
>>> +			chunk_start = i;
>>> +		} else if (node != chunk_node) {
>>> +			err = do_move_pages_to_node(mm, &pagelist, chunk_node);
>>> +			if (err)
>>> +				goto out;
>>> +			err = store_status(status, chunk_start, chunk_node, i - chunk_start);
>>> +			if (err)
>>> +				goto out;
>>> +			chunk_start = i;
>>> +			chunk_node = node;
>>>  		}

[...]

>>> +		err = do_move_pages_to_node(mm, &pagelist, chunk_node);
>>> +		if (err)
>>> +			goto out;
>>> +		if (i > chunk_start) {
>>> +			err = store_status(status, chunk_start, chunk_node, i - chunk_start);
>>> +			if (err)
>>> +				goto out;
>>> +		}
>>> +		chunk_node = NUMA_NO_NODE;
>>
>> This block of code is bit confusing.
> 
> I believe this is easier to grasp when looking at the resulting code.
>>
>> 1) Why attempt to migrate when just one page could not be isolated ?
>> 2) 'i' is always greater than chunk_start except the starting page
>> 3) Why reset chunk_node as NUMA_NO_NODE ?
> 
> This is all about flushing the pending state on an error and
> distinguising a fresh batch.

Okay. Will test it out on a multi node system once I get hold of one.