From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934339AbcJMWD0 (ORCPT ); Thu, 13 Oct 2016 18:03:26 -0400 Received: from mail-db5eur01on0101.outbound.protection.outlook.com ([104.47.2.101]:7269 "EHLO EUR01-DB5-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1756570AbcJMWDH (ORCPT ); Thu, 13 Oct 2016 18:03:07 -0400 Authentication-Results: spf=none (sender IP is ) smtp.mailfrom=avagin@virtuozzo.com; Date: Thu, 13 Oct 2016 14:46:51 -0700 From: Andrei Vagin To: "Eric W. Biederman" CC: Andrei Vagin , Alexander Viro , , , Subject: Re: [RFC][PATCH] mount: In mark_umount_candidates and __propogate_umount visit each mount once Message-ID: <20161013214650.GB19836@outlook.office365.com> References: <1476141965-21429-1-git-send-email-avagin@openvz.org> <877f9c6ui8.fsf@x220.int.ebiederm.org> <87pon458l1.fsf_-_@x220.int.ebiederm.org> MIME-Version: 1.0 Content-Type: text/plain; charset="koi8-r" Content-Disposition: inline In-Reply-To: <87pon458l1.fsf_-_@x220.int.ebiederm.org> User-Agent: Mutt/1.7.0 (2016-08-17) X-Originating-IP: [162.246.95.100] X-ClientProxiedBy: BY1PR18CA0040.namprd18.prod.outlook.com (10.162.126.50) To VI1PR0801MB1981.eurprd08.prod.outlook.com (10.173.74.14) X-MS-Office365-Filtering-Correlation-Id: 2d1e20a6-c973-4f3e-813a-08d3f3b2790f X-Microsoft-Exchange-Diagnostics: 1;VI1PR0801MB1981;2:NJ35BWKFo/9OjGoZ2tlaEdki0BudeKhddaiX9nqjrY/Qt4dzyDZqcmb7KaOROzsF9J1xL1XIqBosDtSJKauduBYFrOYxLtGGbNbm8xiibZo0daqPUarRCBZT2Y9w10KsXhmh7UP/OS4ExkLfJGEtCKXJJ96/qCEMSS8bJHIjuzIu2qIZ6nbKaxEdQC+1c941;3:ChO9ogql9VyVJhfo7Ze2hFac9QM84bu0V7iv2VtsSwmMoRqYIFcCBdPUnwzxzvUezhbSwelX51PdKPEKFXBzekuy0lX9neIjyDLfC+MMK1DyRsFWkolWfVW0B3kU6tCj;25:QmvYQHhxmAdVxcNhM1Bln44HHb+YOht0KnQbALcVq/1olHlrlvpJwb3yUZ20+jBTYc0gfbJ8Z8a0ZFMxc7aMu0jQaaZHcAbCKVhJ5FCfVR7rEpwwyPM9Nwhvx16QL51kicbUtG5iBhwypKaFAlCWSNQMcBIel+Lmk5cyw+C7jnLMGTPNXnyu8pO9xiPsOF6IMyp1QjUO7elZR7S8b2Gc8VFbgGIPdB0Iti70r6U1spmN12al4so1qv78yNcyAazWlEZ4DqETWQd9AogjbCbQtz/hp8s+oXzyv7YwM2WBobaBGaE6uBUe8JvXRAEDbtFtzpF6WbjZDWxcYNQ4vzwyAezD6PJ0QFFYNTC1P3x8pCDHp0/cfKpiPKUemgUpsdUSDzHoWsq3STYH+IQo4w5XSETA/58t4bO5mMUXyYz++uc= X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:VI1PR0801MB1981; X-Microsoft-Exchange-Diagnostics: 1;VI1PR0801MB1981;31:uExl0hSAyvzK4vG5Eri7PrMdRHgniE6zoRWoDFB/3knZzVGcf8xKBfg9D6Nma+7JZIdoVIQk86JbzKbtjuD9kRRkD9lU57gWGWgENJpd1HXXwANARhppb6d4KCcqpNI81ptIgSSG7m0HYqNd/HPjlOKL1afHWUREcXqaz3tWqY50EcX+vHr1bhiPqXbbeNgel1vJMtSuktN7GRGRUtFthLrRBlx3TnlZXccwu74WEOUStQmGcNRHosjMJCvojCCN;4:wQy8AljF9YA+AVdNXye1VOEqlmdkd/JwtS3Oqs0YwciE/A0iMkuWq2Eni8Oy30NUAs07cnWfOBMotHody9ObE9T+rm6ukKAMKtovMsPQLaZ/FLXeHudlf/P7SDufPx9VrHCh72sGwPgr516bITwVU/6YunP4K3CqoyBFxoiEHcPx2AXeJT5UEi5kZzoUdNxye0ugrD36KtZ5iBc332dVN04PRoW3htR0JnxnGgvKLovIxgDIZp4lLAoIA/2eIFIdoVtbuvPtySz/8p1PsdQNcvQM1Y2bFsZ2JWXjC4hnn7D94w1QEYRzM+imCpqcwxYMwmS3NBQmbhlqyxH8i7OXn3MbtwNITPWeJkRevp53RFNf67gnFVPP9CR3a69cx1XguqrDrzjoLgD8C4Dh4EBWHnOyFqoxmPki5wfH3KUwdoCX14ApW5Xl7hw2o54ezhPb02hT87zdQlyKyznwAOlPnfKGVIeVjkz1i5XTFz6PwoOfjsd/46h8ggI6siZNOBP6 X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-Test: UriScan:(9452136761055)(151381331826461); X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(6040176)(601004)(2401047)(5005006)(8121501046)(10201501046)(3002001)(6042046)(6043046);SRVR:VI1PR0801MB1981;BCL:0;PCL:0;RULEID:;SRVR:VI1PR0801MB1981; X-Forefront-PRVS: 0094E3478A X-Forefront-Antispam-Report: SFV:NSPM;SFS:(10019020)(4630300001)(6009001)(7916002)(24454002)(199003)(189002)(83506001)(68736007)(23686003)(3846002)(76176999)(586003)(54356999)(47776003)(4326007)(66066001)(19580395003)(19580405001)(8676002)(69596002)(33656002)(50986999)(92566002)(81166006)(110136003)(6116002)(1076002)(2906002)(81156014)(305945005)(101416001)(77096005)(53416004)(50466002)(189998001)(7736002)(6666003)(7846002)(6916009)(2950100002)(42186005)(106356001)(5660300001)(4001350100001)(97736004)(9686002)(86362001)(105586002)(575784001)(7099028)(18370500001)(26326002);DIR:OUT;SFP:1102;SCL:1;SRVR:VI1PR0801MB1981;H:outlook.office365.com;FPR:;SPF:None;PTR:InfoNoRecords;A:1;MX:1;LANG:en; X-Microsoft-Exchange-Diagnostics: =?koi8-r?Q?1;VI1PR0801MB1981;23:Tnt3uakdCKVR5hMDoVfaDcbQqBBzst05BzfZ0Zxfn?= =?koi8-r?Q?TZI4H8dtpn9InCI9u8X1NBVrNpD1pfpQNATRV6AjACJWB+BXqJzgIe/MpNmUMC?= =?koi8-r?Q?Mm2xuQy9FXXRaefiYpuFmGkVj+oaV4pYGlfaom+YHkDnECk5T3ysF+JSLulT6x?= =?koi8-r?Q?4UEMh5KxV8tmyW+/AdN0Pn1Qy5UYWHeAL/PqmuNOpaS/JaZ4jSc24qVETFICpP?= =?koi8-r?Q?BwkDFpPEASw/Ol/28HGWmG0Sy6rZSpccFIM1RkWEj/EmL4mMneiBROtOGQ79ne?= =?koi8-r?Q?Eobo+J/3Zv8CaqV8rgqvGUi1UkILY2SrHineY7RYS7woOcPsUut4njD2mS7NZz?= =?koi8-r?Q?dhUZejVSmyfKTpedNLmUT+ZB02i0f7KJgUsyUYGeKoq/bHiJkSuZIB1HSwro18?= =?koi8-r?Q?+HOwag+kMWCpJ29ycDP1M9I2HXZybgkgiTXFUlab3UOTvImLxeOCxPCYFsmhir?= =?koi8-r?Q?t4PhkUnS3lFRw3oo4kyeHfPYlAuLRo+z4sCz83hdT0msoSsgm2j5dtq7Xd48zQ?= =?koi8-r?Q?db+LooFlsWNBepE6VXdFTQMudn5cM+flbBzLdsYs18RQcUfIXfpcoC+wkwy0ol?= =?koi8-r?Q?V65t4GNkHqYh90YtfMHAtjfYuJBh+9/M9r8CSthsUMZiUk/R32VDpO0jx0mJX5?= =?koi8-r?Q?jrDpob0ulrXglq0pX9Xc2AffSCDuiFLZn0zJieh1LEd6Lbe9ZXU3tDWZagtyPg?= =?koi8-r?Q?kSsfY2x7RuL8NyJ/XV6wmCrZS+yRpkhUC7IbADt4iK4cVPe7/a0SrQpxXGF9Ax?= =?koi8-r?Q?8f2m5CGToxpMS2/mEYKyWRIcjojc7tDVqZRbkhjmSK2trFFyNtWuYMQIwj3CQ0?= =?koi8-r?Q?xvdc1NfYBpktUFAAwPlYq/hzIiZl+xh5m2Iuz6s8wl7RFRfeu1OGh2mIlE1cmk?= =?koi8-r?Q?v+O5zfCVNCphMqqSPxm8kaHeX9wv/euWiswBKsoOKdgmIR0IxyEDskM9of5vca?= =?koi8-r?Q?vMMZr74o620Ot6RcVGh+o0csmio8s0h5xminvZXB2JljEWMQ5fXqoYfeyDCXV1?= =?koi8-r?Q?hDHEgzZKloGtIQqbsfueSwxARZu/szp3aCiAnzhsSX+PRsthY/Y71mACSzB+mK?= =?koi8-r?Q?/2TKhHdZfP/kYa67E1AmufbJZ26Jyrl5VDOox1Ct8fKL+nt8NAz0BFZtC8dU7F?= =?koi8-r?Q?2nW++3hYdfZoyl0wB/Mz0IzSSxCErRvAedRMFB+5a21OKSiWRCNnWyVkT56+Ig?= =?koi8-r?Q?+5sYBn5hf8+2kRTjNNznl6MX0G036tJskSxioG6OWRkLCByMm5TjrmTA+5587N?= =?koi8-r?Q?MrIjcGxgRrRNowF96P7Cw=3D=3D?= X-Microsoft-Exchange-Diagnostics: 1;VI1PR0801MB1981;6:c7O/kgGuT9Y96GxNhTaHHf89pzXxwjZJn/xlkeZfAvSkzZPWuAafG/0MMSblqM3Dg/GyPrC4jKjRH7MbnIzmZwWv5wg3Gr+07y2pdS7vWLiYzDIEnJXd0H915Iu124t3AQ4fh57NRzoE9I3IJGFuCyghpc9PCSYYGo/udXehKz3J4kr6QvxK5M1VVyn8Elnj5Oxg1sqLYss5BTQ014OWm+ChI5rFgVv3oxtFP8GlKCvD38UV3cK5nmxFWwxOFVzcGynqXjcEGcPA+QsWfkDAMPk1xWS0VrJMlKow1jt4tw0nanRAH6HMIWezx2ip5nqX;5:gsHIEIG1vb673GJkHRwb9hlG55dnkJ0f1fKWLeB2fADmAKnMzDuwMIwKv26ccNxH+kgQBnJ/9GW8jhVGXWLn28pQ5BgbZ6pWNvn+e4LYyGou2VyWFMrw2+ZMvV8Dn/qTMvLPbAJpcNyjzAMfUbO4og==;24:/1zkeY2DLWmLfH7TsyPyOK9ECX97WBEDwWnfmyiNbM8geUQwXU7vOkTt6LCiAJapmu3C3QLtdrvknbfNeT1ONY7/qb9G4VZGDiNWHgMC/4c=;7:TSNQ80tnC3usaDVG6jXaxCWGzjXgQ3ExEI7b562svnFxjTwvK/U+4pq6gUxa2Yp3e/zuWapIIITp+sZoa7lxkFLLDiIblrv/0wGRfginjQslHiyucoZMnDObjOv/3/Y6UOC35+iLTSlRPvdXCg6HY++wRbLPs5pVd9Bemm3w91SIRqfj2zWaVGFmTCYQmPd+ZeXQlw1ALrrrhJbSPMgzZ+kTJdnvl8FxcGnKAQ9JbHtdJWZz2/brbe4gHufPwg9GJV5bUayE4kUbm/QgMtVTXzZxRb2COHovrES73ZA9lszUwForiPltKTvsL+AaqIILAMLpayItFMdQOW+rWzbWWw== SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-Microsoft-Exchange-Diagnostics: 1;VI1PR0801MB1981;20:4bXk/iykYaY2n/+gTghMHrb5f+6SWlQJ0zc+sSJRxDoFXRScXc+ueeF3Rb//kUeynuweMYwzPGSa7WmDyupqxkgPP4L9fHXUp7XrX5GK4zd8vbbFYqOrzdm78yn4ap3daQ9+VDMal5j5oU96OaC3FvY+WCOEFRoU8jf4smnWhB0= X-OriginatorOrg: virtuozzo.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 13 Oct 2016 21:47:03.5737 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-Transport-CrossTenantHeadersStamped: VI1PR0801MB1981 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Oct 13, 2016 at 02:53:46PM -0500, Eric W. Biederman wrote: > > Adrei Vagin pointed out that time to executue propagate_umount can go > non-linear (and take a ludicrious amount of time) when the mount > propogation trees of the mounts to be unmunted by a lazy unmount > overlap. > > Solve this in the most straight forward way possible, by adding a new > mount flag to mark parts of the mount propagation tree that have been > visited, and use that mark to skip parts of the mount propagation tree > that have already been visited during an unmount. This guarantees > that each mountpoint in the possibly overlapping mount propagation > trees will be visited exactly once. > > Add the functions propagation_visit_next and propagation_revisit_next > to coordinate setting and clearling the visited mount mark. > > Here is a script to generate such mount tree: > $ cat run.sh > mount -t tmpfs test-mount /mnt > mount --make-shared /mnt > for i in `seq $1`; do > mkdir /mnt/test.$i > mount --bind /mnt /mnt/test.$i > done > cat /proc/mounts | grep test-mount | wc -l > time umount -l /mnt > $ for i in `seq 10 16`; do echo $i; unshare -Urm bash ./run.sh $i; done > > Here are the performance numbers with and without the patch: > > mounts | before | after (real sec) > ----------------------------- > 1024 | 0.071 | 0.024 > 2048 | 0.184 | 0.030 > 4096 | 0.604 | 0.040 > 8912 | 4.471 | 0.043 > 16384 | 34.826 | 0.082 > 32768 | | 0.151 > 65536 | | 0.289 > 131072 | | 0.659 > > Andrei Vagin fixing this performance problem is part of the > work to fix CVE-2016-6213. > > Cc: stable@vger.kernel.org > Reported-by: Andrei Vagin > Signed-off-by: "Eric W. Biederman" > --- > > Andrei can you take a look at this patch and see if you can see any > problems. My limited testing suggests this approach does a much better > job of solving the problem you were seeing. With the time looking > almost linear in the number of mounts now. I read this patch and I like the idea. Then I run my tests and one of them doesn't work with this patch. I haven't found a reason yet. Here is the test: [root@fc24 mounts]# cat run.sh set -e mount -t tmpfs zdtm /mnt mkdir -p /mnt/1 /mnt/2 mount -t tmpfs zdtm /mnt/1 mount --make-shared /mnt/1 for i in `seq $1`; do mount --bind /mnt/1 `mktemp -d /mnt/1/test.XXXXXX` done mount --rbind /mnt/1 /mnt/2 cat /proc/self/mountinfo | grep zdtm | wc -l time umount -l /mnt/1 cat /proc/self/mountinfo | grep zdtm | wc -l umount /mnt/2 [root@fc24 mounts]# unshare -Urm ./run.sh 5 65 real 0m0.014s user 0m0.000s sys 0m0.004s 33 umount: /mnt/2: target is busy (In some cases useful info about processes that use the device is found by lsof(8) or fuser(1).) > > fs/pnode.c | 125 ++++++++++++++++++++++++++++++++++++++++++++++++-- > fs/pnode.h | 4 ++ > include/linux/mount.h | 2 + > 3 files changed, 126 insertions(+), 5 deletions(-) > > diff --git a/fs/pnode.c b/fs/pnode.c > index 234a9ac49958..3acce0c75f94 100644 > --- a/fs/pnode.c > +++ b/fs/pnode.c > @@ -164,6 +164,120 @@ static struct mount *propagation_next(struct mount *m, > } > } > > +/* > + * get the next mount in the propagation tree (that has not been visited) > + * @m: the mount seen last > + * @origin: the original mount from where the tree walk initiated > + * > + * Note that peer groups form contiguous segments of slave lists. > + * We rely on that in get_source() to be able to find out if > + * vfsmount found while iterating with propagation_next() is > + * a peer of one we'd found earlier. > + */ > +static struct mount *propagation_visit_next(struct mount *m, > + struct mount *origin) > +{ > + /* Has this part of the propgation tree already been visited? */ > + if (IS_MNT_VISITED(m)) > + return NULL; > + > + SET_MNT_VISITED(m); > + > + /* are there any slaves of this mount? */ > + if (!list_empty(&m->mnt_slave_list)) { > + struct mount *slave = first_slave(m); > + while (1) { > + if (!IS_MNT_VISITED(slave)) > + return slave; > + if (slave->mnt_slave.next == &m->mnt_slave_list) > + break; > + slave = next_slave(slave); > + } > + } > + while (1) { > + struct mount *master = m->mnt_master; > + > + if (master == origin->mnt_master) { > + struct mount *next = next_peer(m); > + while (1) { > + if (next == origin) > + return NULL; > + if (!IS_MNT_VISITED(next)) > + return next; > + next = next_peer(next); > + } > + } else { > + while (1) { > + if (m->mnt_slave.next == &master->mnt_slave_list) > + break; > + m = next_slave(m); > + if (!IS_MNT_VISITED(m)) > + return m; > + } > + } > + > + /* back at master */ > + m = master; > + } > +} > + > +/* > + * get the next mount in the propagation tree (that has not been revisited) > + * @m: the mount seen last > + * @origin: the original mount from where the tree walk initiated > + * > + * Note that peer groups form contiguous segments of slave lists. > + * We rely on that in get_source() to be able to find out if > + * vfsmount found while iterating with propagation_next() is > + * a peer of one we'd found earlier. > + */ > +static struct mount *propagation_revisit_next(struct mount *m, > + struct mount *origin) > +{ > + /* Has this part of the propgation tree already been revisited? */ > + if (!IS_MNT_VISITED(m)) > + return NULL; > + > + CLEAR_MNT_VISITED(m); > + > + /* are there any slaves of this mount? */ > + if (!list_empty(&m->mnt_slave_list)) { > + struct mount *slave = first_slave(m); > + while (1) { > + if (IS_MNT_VISITED(slave)) > + return slave; > + if (slave->mnt_slave.next == &m->mnt_slave_list) > + break; > + slave = next_slave(slave); > + } > + } > + while (1) { > + struct mount *master = m->mnt_master; > + > + if (master == origin->mnt_master) { > + struct mount *next = next_peer(m); > + while (1) { > + if (next == origin) > + return NULL; > + if (IS_MNT_VISITED(next)) > + return next; > + next = next_peer(next); > + } > + } else { > + while (1) { > + if (m->mnt_slave.next == &master->mnt_slave_list) > + break; > + m = next_slave(m); > + if (IS_MNT_VISITED(m)) > + return m; > + } > + } > + > + /* back at master */ > + m = master; > + } > +} > + > static struct mount *next_group(struct mount *m, struct mount *origin) > { > while (1) { > @@ -399,11 +513,12 @@ static void mark_umount_candidates(struct mount *mnt) > > BUG_ON(parent == mnt); > > - for (m = propagation_next(parent, parent); m; > - m = propagation_next(m, parent)) { > + for (m = propagation_visit_next(parent, parent); m; > + m = propagation_visit_next(m, parent)) { > struct mount *child = __lookup_mnt_last(&m->mnt, > mnt->mnt_mountpoint); > - if (child && (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m))) { > + if (child && (!IS_MNT_LOCKED(child) || > + IS_MNT_MARKED(m))) { > SET_MNT_MARK(child); > } > } > @@ -420,8 +535,8 @@ static void __propagate_umount(struct mount *mnt) > > BUG_ON(parent == mnt); > > - for (m = propagation_next(parent, parent); m; > - m = propagation_next(m, parent)) { > + for (m = propagation_revisit_next(parent, parent); m; > + m = propagation_revisit_next(m, parent)) { > > struct mount *child = __lookup_mnt_last(&m->mnt, > mnt->mnt_mountpoint); > diff --git a/fs/pnode.h b/fs/pnode.h > index 550f5a8b4fcf..988ea4945764 100644 > --- a/fs/pnode.h > +++ b/fs/pnode.h > @@ -21,6 +21,10 @@ > #define CLEAR_MNT_MARK(m) ((m)->mnt.mnt_flags &= ~MNT_MARKED) > #define IS_MNT_LOCKED(m) ((m)->mnt.mnt_flags & MNT_LOCKED) > > +#define IS_MNT_VISITED(m) ((m)->mnt.mnt_flags & MNT_VISITED) > +#define SET_MNT_VISITED(m) ((m)->mnt.mnt_flags |= MNT_VISITED) > +#define CLEAR_MNT_VISITED(m) ((m)->mnt.mnt_flags &= ~MNT_VISITED) > + > #define CL_EXPIRE 0x01 > #define CL_SLAVE 0x02 > #define CL_COPY_UNBINDABLE 0x04 > diff --git a/include/linux/mount.h b/include/linux/mount.h > index 9227b190fdf2..6048045b96c3 100644 > --- a/include/linux/mount.h > +++ b/include/linux/mount.h > @@ -52,6 +52,8 @@ struct mnt_namespace; > > #define MNT_INTERNAL 0x4000 > > +#define MNT_VISITED 0x010000 > + > #define MNT_LOCK_ATIME 0x040000 > #define MNT_LOCK_NOEXEC 0x080000 > #define MNT_LOCK_NOSUID 0x100000 > -- > 2.8.3 >