On Tue, 2022-02-15 at 13:51 +0100, Oscar Salvador wrote: > On Sat, Feb 12, 2022 at 09:37:40PM -0500, Rik van Riel wrote: > > Sometimes the page offlining code can leave behind a hwpoisoned > > clean > > page cache page. This can lead to programs being killed over and > > over > > and over again as they fault in the hwpoisoned page, get killed, > > and > > then get re-spawned by whatever wanted to run them. > > Hi Rik, > > Do you know how that exactly happens? We should not be really leaving > anything behind, and soft-offline (not hard) code works with the > premise > of only poisoning a page in case it was contained, so I am wondering > what is going on here. > > In-use pagecache pages are migrated away, and the actual page is > contained, and for clean ones, we already do the > invalidate_inode_page() > and then contain it in case we succeed. I do not know the exact failure case, since I have never caught a system in the act of leaking one of these pages. I just know I have seen this issue on systems where the "soft_offline: %#lx: invalidated\n" printk was the only offline method leaving any message in the kernel log. However, there are a few code paths through the soft offlining code path that don't seem to have any printks, so I am not sure exactly where things went wrong. I only really found the aftermath, and tested this patch by loading it as a kernel live patch module on some of those systems. -- All Rights Reversed.