From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751926AbcFUKep (ORCPT ); Tue, 21 Jun 2016 06:34:45 -0400 Received: from mail-am1on0114.outbound.protection.outlook.com ([157.56.112.114]:60032 "EHLO emea01-am1-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751123AbcFUKe0 (ORCPT ); Tue, 21 Jun 2016 06:34:26 -0400 Authentication-Results: spf=none (sender IP is ) smtp.mailfrom=VDavydov@virtuozzo.com; Date: Tue, 21 Jun 2016 13:16:51 +0300 From: Vladimir Davydov To: Johannes Weiner CC: Tejun Heo , Andrew Morton , Michal Hocko , Li Zefan , , , , Subject: Re: [PATCH 3/3] mm: memcontrol: fix cgroup creation failure after many small jobs Message-ID: <20160621101650.GD15970@esperanza> References: <20160616034244.14839-1-hannes@cmpxchg.org> <20160616200617.GD3262@mtj.duckdns.org> <20160617162310.GA19084@cmpxchg.org> <20160617162516.GD19084@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20160617162516.GD19084@cmpxchg.org> X-Originating-IP: [195.214.232.10] X-ClientProxiedBy: HE1PR01CA0049.eurprd01.prod.exchangelabs.com (10.165.170.145) To VI1PR0801MB1632.eurprd08.prod.outlook.com (10.168.66.139) X-MS-Office365-Filtering-Correlation-Id: 7ec4d2d3-e801-444d-1df5-08d399bd2de8 X-Microsoft-Exchange-Diagnostics: 1;VI1PR0801MB1632;2:IiZFyh1nBPGE/ElKb1ebEZ378SY7c+V7tKqmLslKT2lWJ9vXDeHYAw33x/SQzVefPEt4TzLurUTbQJqpDDaMsUmHbWKUGA7bMfHdBFSxVW8j/6/VYKQqsvo23BUmCrcjnxRqB2GLfEiVtDIvroNJAMrVDjXCWp2ed/vDN0d/hBodYjgTMORat7IHUDEWOycZ;3:uJ6D8rh32hRdhmpNfBSH6/ZK/gEE3i1UPDYj/xZIO5lwpsK41MnTVd46udtsNbHiL+5WXipJh3W8g9/OM2Hs317x5ZSW+sWr+jHIr4c5N3EGyqt1UwADHc88gJPT5Vfs;25:TAECNBsNrXQh+zeZV4S7putQOPcc/mm6f/MoDTyDU0XvYBioNJSpm5jjLzod0zfB+5RzcGUVhEvWIaDYi0L5JGeG7QzV3yj/s9Ufq1H3LokDDqEO1ItOYvaYfBZx4S5NvgItt5Eeg3csRQCPAWd3YO6bjewc1Ps4lWddChM032EfFpSJGBAE5SPIEX7AXEYc6XQ529WwvSB+un6jrNgLdDfq/jmlDzRkODIQXqaTq0tLZp51WzMcFitGPEK6auD4+Jia9BKqpZHmedGwMysSjc6R3k8yEn2b1vMpxzuetJjgQgJYj52LSC6iy9zAESiYuuD3o3mDD6zOMja99jBKIVySJaCq0tW3DcGBu7PPnyeu4Q9nqA2vqLOVWGk/PyPGKDAreG8JO2c+iixpmmjtW+k3jzUjKfGd4G78w+g3L34= X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:VI1PR0801MB1632; X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-Test: UriScan:; X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(6040130)(601004)(2401047)(5005006)(8121501046)(10201501046)(3002001)(6041072)(6043046);SRVR:VI1PR0801MB1632;BCL:0;PCL:0;RULEID:;SRVR:VI1PR0801MB1632; X-Microsoft-Exchange-Diagnostics: 1;VI1PR0801MB1632;4:1T4q69F7y3gwLZ+gSDCuyt7ld5naMoSDMCfMirP+g7HhJ9xQDN/xGcpaLfEJc7hU0BY9nTFaoK1eilCktAheUa3XqQgEY3x+ipurwwrHvCZk43/t/JuvQYr4Gm31zJuRYtS3ta0xi9SSLJFjSl6+CVWjWd4lWS3bFToY96bgxbT0g5at7V+F/VVYY5dxLfah6gyFq+dTK0C7hZ5cHzkjOqlLQDzhKcBW8zAQLn95R/2JuqvNpUEBneUoO0mhLBscgby9W0F5Js91Et0lpzCe6lpjK8NdF6W96I7h+jP8WnGwJzfbrxomeSVeIYuIH/JVdn60dLTfrzF6hVHGz3L+5BIHpAlnnQf2F8xkCjapO/zArPVUnco9uWfGkhLKtjfJi2pr0whY1S2taWAxizEL1vZJ+lH89c1Ck8dKyR1fCVY= X-Forefront-PRVS: 098076C36C X-Forefront-Antispam-Report: SFV:NSPM;SFS:(10019020)(4630300001)(6009001)(7916002)(189002)(24454002)(199003)(66066001)(2906002)(4326007)(50986999)(76176999)(81156014)(77096005)(189998001)(19580405001)(101416001)(1076002)(7736002)(50466002)(46406003)(81166006)(93886004)(47776003)(110136002)(33656002)(54356999)(80792005)(19580395003)(3846002)(586003)(105586002)(33716001)(7846002)(86362001)(92566002)(8676002)(97756001)(23726003)(42186005)(6116002)(9686002)(2950100001)(68736007)(106356001)(97736004);DIR:OUT;SFP:1102;SCL:1;SRVR:VI1PR0801MB1632;H:esperanza;FPR:;SPF:None;PTR:InfoNoRecords;MX:1;A:1;LANG:en; X-Microsoft-Exchange-Diagnostics: =?us-ascii?Q?1;VI1PR0801MB1632;23:qvDh9+V3NV16Vwra7DyuZ4FbPOFSYudRYrV3ko9?= =?us-ascii?Q?gXvJiFJe/pG/W29kKiSNjcOMloZaC3W1zedF+wi2SmLuJdDE1sfv8oaV6Z48?= =?us-ascii?Q?3oIEXBlrQ71MS5teUprA9mjdF9x6duaUQFXjVnbs8Pz044YIlesdNXIHjO17?= =?us-ascii?Q?9vabdDjZ54X++IbNEAIJUpsJ4FV1GSVcH+G8d7CBgrrgu8OQO6SrkDGqedOZ?= =?us-ascii?Q?K/03sMEXC5ML5IC+fWwg4HE2dEud2663hNR5ZzKsFh4+tdfqmUf3FJofqmpP?= =?us-ascii?Q?TGjtbNoJVt+Q9FSPn687E4l0UDCekcGqq+YzXCZnewLlExHfGps5WVPMtO6d?= =?us-ascii?Q?IkMrdPoA8mumXgyKbpM05169rkfIzeWlVTg4YaSIp7blFHv5hjabdr7z1Awn?= =?us-ascii?Q?DEv1RMrcS8ynAmcKNFRuwYgoDK9KF9zx3kiV40bjP4AjBuN+HLPVJpgqeSGT?= =?us-ascii?Q?tSxOvAQikiYbXkz9Mjb25c/g2V3BSxYWUP8g/iV1ZNSQThYkGTYsaMfATokw?= =?us-ascii?Q?9S23amqQcpLM4Np8HQZ7UFD35AcxXGeEvu8HAmZvgnprIHqTlv54ns8ix8K8?= =?us-ascii?Q?lsjEYsxZax+zHfJ56lFRt/8qfXKxT6bt+eIfXO3YhX+xpe+qysgu8g8AHmsx?= =?us-ascii?Q?IfrKWfIXMcDVwJjabloHd6hEGuHdcWrfGCx79N2AIQJ6vJxWH100o6fU6OnW?= =?us-ascii?Q?WnfGg3GkZto/JBAS/eEmb8ORQOiIRUSHB0JmQkP/s1eFPG3uH4s+HEp7VN+A?= =?us-ascii?Q?xBrPa4+mQQ9GvPc6fLaHdj10ZXcTzlWbdHsvhzlp/+Wzd3ixBv/UCQQ5BgEM?= =?us-ascii?Q?Vz6Mqm5iBp7+x08dGVN6VvPe0kYUJ8dOM3nynI4A7oLaQqjdaOWTZDNFJv8J?= =?us-ascii?Q?cCF23Jb3RGEGiWCk1dWY0ZUZHYqQ6XNiyfrWrjfgMq3wXo0+FzaeYmRTQeBC?= =?us-ascii?Q?rSWU5rF0EnUOtjKTWk4mcxQdr5NAzCb7hFZY2WuvDTFJJcdsTAijtBHwQ8l6?= =?us-ascii?Q?Kw01L76Spist94Rv4GwlVqxi3kLjot5+rYpQfKT6Z0ltuYhAaZZGazNFsneA?= =?us-ascii?Q?b9gDyrtBO/EWx/iyQOF/0qbmNXtFk2C5V7408GtFmD5CyMzZZWO/NQv6Vuoc?= =?us-ascii?Q?IVZH4IRJMV4o=3D?= X-Microsoft-Exchange-Diagnostics: 1;VI1PR0801MB1632;6:9V+eOZNEl6tfDapI+JeXheSSYg/JYaAvN31RN0UxmtRf4mBmgP2Ft3DoCizFCFtCNeReQzquUuc3bjWejcHZ4Aya4WlC241f3HNfelon+phqnlcKr6LDc89w7kz7Phn8yv63rHVj66cqZATL1KXx4BdSOBlejis1NSa98HCtsP2cngr7UTbBUg64HJaHXf7nHi5eJhUWcsGFJLmxCd4F7tfcsPOFESAqE9omJoVzWa/fxIGh/h/43O/I65MtWvUP2o8Ckf0w5eq4SDh5+gMusm+j/npkZGhzSIxX4KzXAQ8bkkF5MR8pDcpCBi7aUDFQ;5:76azqbAOXA7CLEdsG2Lx7WuFEB1NuNkrXg53IX+GY5ILDaYMmKS+T3DvUwy0Z1y7qYVQaiIamcA+Tlp66Kf7S+hC1k5vSGbFwdJnAwizqYdIIymCDl0Xdxm2YB2WMHar1615Ksel4VCGo5Lf///AUQ==;24:wB7zgEeUiMRrr6gKN5bE1ANAY810JxjcXpiTq9V9gFk4d3AHaT98KySPu/7hDZZd93+jPeVFOcSv1aPpP1IoJXWTkcJYSCY57nHB1MKwmpY=;7:gvwGF/0WW6Cv4bD9CHXlcAKDx/nVcf6MAAOAe+FGCOYfqAYh+ZtQiSVKkL0PhY6/W6+cO9reVyHnHiDNRvoT198usW9SZDJhxsFH12LsB01FpJfuqdNcMxDHgvYFW7L0QXMmaCF8Nh/EKfnxaS9gFBQgx+yIVrV4HFKGNhOfZKyQvJ32wf6S9WOVuzmK5nXijFy2b3/Y8gstm6+zI/PCpclAUl14+9kbKvQ6dUDeVnW6hbZ0udbFe9axfuVai2GS SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-Microsoft-Exchange-Diagnostics: 1;VI1PR0801MB1632;20:/0iQ1e5UaFlJDSPgtMaTA7TOGC6G0ZB7bm2W9ADArqkwqGkQAqctJPZrtb/sAGLgoRiUa6NN2o5QupRDY3xoFJKEIoOkK5UQj0Mt6YryOWqzPsk81LLftL5c+fk7wQEiCmdJ/tjpvZ2sXxpUhTN3mixvcFTzb5GkMLfAceGNGe4= X-OriginatorOrg: virtuozzo.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 21 Jun 2016 10:16:57.3245 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-Transport-CrossTenantHeadersStamped: VI1PR0801MB1632 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jun 17, 2016 at 12:25:16PM -0400, Johannes Weiner wrote: > The memory controller has quite a bit of state that usually outlives > the cgroup and pins its CSS until said state disappears. At the same > time it imposes a 16-bit limit on the CSS ID space to economically > store IDs in the wild. Consequently, when we use cgroups to contain > frequent but small and short-lived jobs that leave behind some page > cache, we quickly run into the 64k limitations of outstanding CSSs. > Creating a new cgroup fails with -ENOSPC while there are only a few, > or even no user-visible cgroups in existence. > > Although pinning CSSs past cgroup removal is common, there are only > two instances that actually need an ID after a cgroup is deleted: > cache shadow entries and swapout records. > > Cache shadow entries reference the ID weakly and can deal with the CSS > having disappeared when it's looked up later. They pose no hurdle. > > Swap-out records do need to pin the css to hierarchically attribute > swapins after the cgroup has been deleted; though the only pages that > remain swapped out after offlining are tmpfs/shmem pages. And those > references are under the user's control, so they are manageable. > > This patch introduces a private 16-bit memcg ID and switches swap and > cache shadow entries over to using that. This ID can then be recycled > after offlining when the CSS remains pinned only by objects that don't > specifically need it. > > This script demonstrates the problem by faulting one cache page in a > new cgroup and deleting it again: > > set -e > mkdir -p pages > for x in `seq 128000`; do > [ $((x % 1000)) -eq 0 ] && echo $x > mkdir /cgroup/foo > echo $$ >/cgroup/foo/cgroup.procs > echo trex >pages/$x > echo $$ >/cgroup/cgroup.procs > rmdir /cgroup/foo > done > > When run on an unpatched kernel, we eventually run out of possible IDs > even though there are no visible cgroups: > > [root@ham ~]# ./cssidstress.sh > [...] > 65000 > mkdir: cannot create directory '/cgroup/foo': No space left on device > > After this patch, the IDs get released upon cgroup destruction and the > cache and css objects get released once memory reclaim kicks in. With 65K cgroups it will take the reclaimer a substantial amount of time to iterate over all of them, which might result in latency spikes. Probably, to avoid that, we could move pages from a dead cgroup's lru to its parent's one on offline while still leaving dead cgroups pinned, like we do in case of list_lru entries. > > Signed-off-by: Johannes Weiner Reviewed-by: Vladimir Davydov One nit below. ... > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 75e74408cc8f..dc92b2df2585 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -4057,6 +4057,60 @@ static struct cftype mem_cgroup_legacy_files[] = { > { }, /* terminate */ > }; > > +/* > + * Private memory cgroup IDR > + * > + * Swap-out records and page cache shadow entries need to store memcg > + * references in constrained space, so we maintain an ID space that is > + * limited to 16 bit (MEM_CGROUP_ID_MAX), limiting the total number of > + * memory-controlled cgroups to 64k. > + * > + * However, there usually are many references to the oflline CSS after > + * the cgroup has been destroyed, such as page cache or reclaimable > + * slab objects, that don't need to hang on to the ID. We want to keep > + * those dead CSS from occupying IDs, or we might quickly exhaust the > + * relatively small ID space and prevent the creation of new cgroups > + * even when there are much fewer than 64k cgroups - possibly none. > + * > + * Maintain a private 16-bit ID space for memcg, and allow the ID to > + * be freed and recycled when it's no longer needed, which is usually > + * when the CSS is offlined. > + * > + * The only exception to that are records of swapped out tmpfs/shmem > + * pages that need to be attributed to live ancestors on swapin. But > + * those references are manageable from userspace. > + */ > + > +static struct idr mem_cgroup_idr; static DEFINE_IDR(mem_cgroup_idr);