On 05.04.2019 5:12, Hugh Dickins wrote: > On Tue, 2 Apr 2019, Hugh Dickins wrote: >> On Sun, 31 Mar 2019, Hugh Dickins wrote: >>> On Sun, 31 Mar 2019, Alex Xu (Hello71) wrote: >>>> Excerpts from Vineeth Pillai's message of March 25, 2019 6:08 pm: >>>>> On Sun, Mar 24, 2019 at 11:30 AM Alex Xu (Hello71) wrote: >>>>>> >>>>>> I get this BUG in 5.1-rc1 sometimes when powering off the machine. I >>>>>> suspect my setup erroneously executes two swapoff+cryptsetup close >>>>>> operations simultaneously, so a race condition is triggered. >>>>>> >>>>>> I am using a single swap on a plain dm-crypt device on a MBR partition >>>>>> on a SATA drive. >>>>>> >>>>>> I think the problem is probably related to >>>>>> b56a2d8af9147a4efe4011b60d93779c0461ca97, so CCing the related people. >>>>>> >>>>> Could you please provide more information on this - stack trace, dmesg etc? >>>>> Is it easily reproducible? If yes, please detail the steps so that I >>>>> can try it inhouse. >>>>> >>>>> Thanks, >>>>> Vineeth >>>>> >>>> >>>> Some info from the BUG entry (I didn't bother to type it all, >>>> low-quality image available upon request): >>>> >>>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 >>>> #PF error: [normal kernel read fault] >>>> PGD 0 P4D 0 >>>> Oops: 0000 [#1] SMP >>>> CPU: 0 Comm: swapoff Not tainted 5.1.0-rc1+ #2 >>>> RIP: 0010:shmem_recalc_inode+0x41/0x90 >>>> >>>> Call Trace: >>>> ? shmem_undo_range >>>> ? rb_erase_cached >>>> ? set_next_entity >>>> ? __inode_wait_for_writeback >>>> ? shmem_truncate_range >>>> ? shmem_evict_inode >>>> ? evict >>>> ? shmem_unuse >>>> ? try_to_unuse >>>> ? swapcache_free_entries >>>> ? _cond_resched >>>> ? __se_sys_swapoff >>>> ? do_syscall_64 >>>> ? entry_SYSCALL_64_after_hwframe >>>> >>>> As I said, it only occurs occasionally on shutdown. I think it is a safe >>>> guess that it can only occur when the swap is not empty, but possibly >>>> other conditions are necessary, so I will test further. >>> >>> Thanks for the update, Alex. I'm looking into a couple of bugs with the >>> 5.1-rc swapoff, but this one doesn't look like anything I know so far. >>> shmem_recalc_inode() is a surprising place to crash: it's as if the >>> igrab() in shmem_unuse() were not working. >>> >>> Yes, please do send Vineeth and me (or the lists) your low-quality image, >>> in case we can extract any more info from it; and also please the >>> disassembly of your kernel's shmem_recalc_inode(), so we can be sure of >>> exactly what it's crashing on (though I expect that will leave me as >>> puzzled as before). >>> >>> If you want to experiment with one of my fixes, not yet written up and >>> posted, just try changing SWAP_UNUSE_MAX_TRIES in mm/swapfile.c from >>> 3 to INT_MAX: I don't see how that issue could manifest as crashing in >>> shmem_recalc_inode(), but I may just be too stupid to see it. >> >> Thanks for the image and disassembly you sent: which showed that the >> ffffffff81117351: 48 83 3f 00 cmpq $0x0,(%rdi) >> you are crashing on, is the "if (sbinfo->max_blocks)" in the inlined >> shmem_inode_unacct_blocks(): inode->i_sb->s_fs_info is NULL, which is >> something that shmem_put_super() does. >> >> Eight-year-old memories stirred: I knew when looking at Vineeth's patch, >> that I ought to look back through the history of mm/shmem.c, to check >> some points that Konstantin Khlebnikov had made years ago, that >> surprised me then and were in danger of surprising us again with this >> rework. But I failed to do so: thank you Alex, for reporting this bug >> and pointing us back there. >> >> igrab() protects from eviction but does not protect from unmounting. >> I bet that is what you are hitting, though I've not even read through >> 2.6.39's 778dd893ae785 ("tmpfs: fix race between umount and swapoff") >> again yet, and not begun to think of the fix for it this time around; >> but wanted to let you know that this bug is now (probably) identified. > > Hi Alex, could you please give the patch below a try? It fixes a > problem, but I'm not sure that it's your problem - please let us know. > > I've not yet written up the commit description, and this should end up > as 4/4 in a series fixing several new swapoff issues: I'll wait to post > the finished series until heard back from you. > > I did first try following the suggestion Konstantin had made back then, > for a similar shmem_writepage() case: atomic_inc_not_zero(&sb->s_active). > > But it turned out to be difficult to get right in shmem_unuse(), because > of the way that relies on the inode as a cursor in the list - problem > when you've acquired an s_active reference, but fail to acquire inode > reference, and cannot safely release the s_active reference while still > holding the swaplist mutex. > > If VFS offered an isgrab(inode), like igrab() but acquiring s_active > reference while holding i_lock, that would drop very easily into the > current shmem_unuse() as a replacement there for igrab(). But the rest > of the world has managed without that for years, so I'm disinclined to > add it just for this. And the patch below seems good enough without it. > > Thanks, > Hugh > > --- > > include/linux/shmem_fs.h | 1 + > mm/shmem.c | 39 ++++++++++++++++++--------------------- > 2 files changed, 19 insertions(+), 21 deletions(-) > > --- 5.1-rc3/include/linux/shmem_fs.h 2019-03-17 16:18:15.181820820 -0700 > +++ linux/include/linux/shmem_fs.h 2019-04-04 16:18:08.193512968 -0700 > @@ -21,6 +21,7 @@ struct shmem_inode_info { > struct list_head swaplist; /* chain of maybes on swap */ > struct shared_policy policy; /* NUMA memory alloc policy */ > struct simple_xattrs xattrs; /* list of xattrs */ > + atomic_t stop_eviction; /* hold when working on inode */ > struct inode vfs_inode; > }; > > --- 5.1-rc3/mm/shmem.c 2019-03-17 16:18:15.701823872 -0700 > +++ linux/mm/shmem.c 2019-04-04 16:18:08.193512968 -0700 > @@ -1081,9 +1081,15 @@ static void shmem_evict_inode(struct ino > } > spin_unlock(&sbinfo->shrinklist_lock); > } > - if (!list_empty(&info->swaplist)) { > + while (!list_empty(&info->swaplist)) { > + /* Wait while shmem_unuse() is scanning this inode... */ > + wait_var_event(&info->stop_eviction, > + !atomic_read(&info->stop_eviction)); > mutex_lock(&shmem_swaplist_mutex); > list_del_init(&info->swaplist); > + /* ...but beware of the race if we peeked too early */ > + if (!atomic_read(&info->stop_eviction)) > + list_del_init(&info->swaplist); > mutex_unlock(&shmem_swaplist_mutex); > } > } > @@ -1227,36 +1233,27 @@ int shmem_unuse(unsigned int type, bool > unsigned long *fs_pages_to_unuse) > { > struct shmem_inode_info *info, *next; > - struct inode *inode; > - struct inode *prev_inode = NULL; > int error = 0; > > if (list_empty(&shmem_swaplist)) > return 0; > > mutex_lock(&shmem_swaplist_mutex); > - > - /* > - * The extra refcount on the inode is necessary to safely dereference > - * p->next after re-acquiring the lock. New shmem inodes with swap > - * get added to the end of the list and we will scan them all. > - */ > list_for_each_entry_safe(info, next, &shmem_swaplist, swaplist) { > if (!info->swapped) { > list_del_init(&info->swaplist); > continue; > } > - > - inode = igrab(&info->vfs_inode); > - if (!inode) > - continue; > - > + /* > + * Drop the swaplist mutex while searching the inode for swap; > + * but before doing so, make sure shmem_evict_inode() will not > + * remove placeholder inode from swaplist, nor let it be freed > + * (igrab() would protect from unlink, but not from unmount). > + */ > + atomic_inc(&info->stop_eviction); > mutex_unlock(&shmem_swaplist_mutex); > - if (prev_inode) > - iput(prev_inode); > - prev_inode = inode; This seems too ad hoc solution. Superblock could be protected with s_umount, in same way as writeback pins it in __writeback_inodes_wb() Please see (completely untested) patch in attachment. > > - error = shmem_unuse_inode(inode, type, frontswap, > + error = shmem_unuse_inode(&info->vfs_inode, type, frontswap, > fs_pages_to_unuse); > cond_resched(); > > @@ -1264,14 +1261,13 @@ int shmem_unuse(unsigned int type, bool > next = list_next_entry(info, swaplist); > if (!info->swapped) > list_del_init(&info->swaplist); > + if (atomic_dec_and_test(&info->stop_eviction)) > + wake_up_var(&info->stop_eviction); > if (error) > break; > } > mutex_unlock(&shmem_swaplist_mutex); > > - if (prev_inode) > - iput(prev_inode); > - > return error; > } > > @@ -2238,6 +2234,7 @@ static struct inode *shmem_get_inode(str > info = SHMEM_I(inode); > memset(info, 0, (char *)inode - (char *)info); > spin_lock_init(&info->lock); > + atomic_set(&info->stop_eviction, 0); > info->seals = F_SEAL_SEAL; > info->flags = flags & VM_NORESERVE; > INIT_LIST_HEAD(&info->shrinklist); >