From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756136AbdCGUCQ (ORCPT ); Tue, 7 Mar 2017 15:02:16 -0500 Received: from mx1.redhat.com ([209.132.183.28]:40524 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755596AbdCGUCN (ORCPT ); Tue, 7 Mar 2017 15:02:13 -0500 Message-ID: <1488916356.6405.4.camel@redhat.com> Subject: Re: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever From: Rik van Riel To: Michal Hocko , Andrew Morton Cc: Mel Gorman , Johannes Weiner , Vlastimil Babka , Tetsuo Handa , linux-mm@kvack.org, LKML , Michal Hocko Date: Tue, 07 Mar 2017 14:52:36 -0500 In-Reply-To: <20170307133057.26182-1-mhocko@kernel.org> References: <20170307133057.26182-1-mhocko@kernel.org> Content-Type: multipart/signed; micalg="pgp-sha256"; protocol="application/pgp-signature"; boundary="=-VK4R86LtK6PJs5B0tUAu" Mime-Version: 1.0 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.31]); Tue, 07 Mar 2017 19:52:40 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org --=-VK4R86LtK6PJs5B0tUAu Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Tue, 2017-03-07 at 14:30 +0100, Michal Hocko wrote: > From: Michal Hocko >=20 > Tetsuo Handa has reported [1][2] that direct reclaimers might get > stuck > in too_many_isolated loop basically for ever because the last few > pages > on the LRU lists are isolated by the kswapd which is stuck on fs > locks > when doing the pageout or slab reclaim. This in turn means that there > is > nobody to actually trigger the oom killer and the system is basically > unusable. >=20 > too_many_isolated has been introduced by 35cd78156c49 ("vmscan: > throttle > direct reclaim when too many pages are isolated already") to prevent > from pre-mature oom killer invocations because back then no reclaim > progress could indeed trigger the OOM killer too early. But since the > oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection") > the allocation/reclaim retry loop considers all the reclaimable pages > and throttles the allocation at that layer so we can loosen the > direct > reclaim throttling. It only does this to some extent. =C2=A0If reclaim made no progress, for example due to immediately bailing out because the number of already isolated pages is too high (due to many parallel reclaimers), the code could hit the "no_progress_loops > MAX_RECLAIM_RETRIES" test without ever looking at the number of reclaimable pages. Could that create problems if we have many concurrent reclaimers? It may be OK, I just do not understand all the implications. I like the general direction your patch takes the code in, but I would like to understand it better... --=20 All rights reversed --=-VK4R86LtK6PJs5B0tUAu Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAABCAAGBQJYvw+FAAoJEM553pKExN6DgGcH+gKB8lybd0g8awJexA3cBeaZ WLFj/xAIyBVvZwKiSkSPj0wOLOUZkWHI7vw4rO8Uu2AjEhgXB1yAz0No5dSYzECm zgTxzdA1ONzxhGK1iA0g7uefvGBDRESOxU6z50VwkKfkBgWcHk0h0nVj9/FhyBfK be0/hSAXDDh4GxdV7uR/+hc0Qj6U9ORyHUxgf9Evxh7UozQ0K7jDRaclgTB8Ilu7 t5FPRKBTz3k1zQEqLUQWp58V+kIuHRu2mnq64qD6r58AXeVZ14cnli/B0qRRLHSo evY1kmUr8S1LwvqvJGmD8Mr0KoaQoN1wGCyWAt+SvDjpAgx0ZB2Pnp/oyr2fCBE= =/ugz -----END PGP SIGNATURE----- --=-VK4R86LtK6PJs5B0tUAu--