Re: [PATCH v2 2/5] 9pfs: fix qemu_mknodat(S_IFSOCK) on macOS

From: Christian Schoenebeck <qemu_oss@crudebyte.com>
To: qemu-devel@nongnu.org
Cc: Keno Fischer <keno@juliacomputing.com>,
	Michael Roitzsch <reactorcontrol@icloud.com>,
	Will Cohen <wwcohen@gmail.com>,  Greg Kurz <groug@kaod.org>,
	qemu-stable@nongnu.org, Akihiko Odaki <akihiko.odaki@gmail.com>
Subject: Re: [PATCH v2 2/5] 9pfs: fix qemu_mknodat(S_IFSOCK) on macOS
Date: Sun, 24 Apr 2022 20:45:21 +0200	[thread overview]
Message-ID: <3849551.ofAv5PygDX@silver> (raw)
In-Reply-To: <eafd4bbf-dbff-323a-179f-8f29905701e1@gmail.com>

On Samstag, 23. April 2022 06:33:50 CEST Akihiko Odaki wrote:
> On 2022/04/22 23:06, Christian Schoenebeck wrote:
> > On Freitag, 22. April 2022 04:43:40 CEST Akihiko Odaki wrote:
> >> On 2022/04/22 0:07, Christian Schoenebeck wrote:
> >>> mknod() on macOS does not support creating sockets, so divert to
> >>> call sequence socket(), bind() and chmod() respectively if S_IFSOCK
> >>> was passed with mode argument.
> >>> 
> >>> Link: https://lore.kernel.org/qemu-devel/17933734.zYzKuhC07K@silver/
> >>> Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> >>> Reviewed-by: Will Cohen <wwcohen@gmail.com>
> >>> ---
> >>> 
> >>>    hw/9pfs/9p-util-darwin.c | 27 ++++++++++++++++++++++++++-
> >>>    1 file changed, 26 insertions(+), 1 deletion(-)
> >>> 
> >>> diff --git a/hw/9pfs/9p-util-darwin.c b/hw/9pfs/9p-util-darwin.c
> >>> index e24d09763a..39308f2a45 100644
> >>> --- a/hw/9pfs/9p-util-darwin.c
> >>> +++ b/hw/9pfs/9p-util-darwin.c
> >>> @@ -74,6 +74,27 @@ int fsetxattrat_nofollow(int dirfd, const char
> >>> *filename, const char *name,>
> >>> 
> >>>     */
> >>>    
> >>>    #if defined CONFIG_PTHREAD_FCHDIR_NP
> >>> 
> >>> +static int create_socket_file_at_cwd(const char *filename, mode_t mode)
> >>> {
> >>> +    int fd, err;
> >>> +    struct sockaddr_un addr = {
> >>> +        .sun_family = AF_UNIX
> >>> +    };
> >>> +
> >>> +    fd = socket(PF_UNIX, SOCK_DGRAM, 0);
> >>> +    if (fd == -1) {
> >>> +        return fd;
> >>> +    }
> >>> +    snprintf(addr.sun_path, sizeof(addr.sun_path), "./%s", filename);
> >> 
> >> It would result in an incorrect path if the path does not fit in
> >> addr.sun_path. It should report an explicit error instead.
> > 
> > Looking at its header file, 'sun_path' is indeed defined on macOS with an
> > oddly small size of only 104 bytes. So yes, I should explicitly handle
> > that
> > error case.
> > 
> > I'll post a v3.
> > 
> >>> +    err = bind(fd, (struct sockaddr *) &addr, sizeof(addr));
> >>> +    if (err == -1) {
> >>> +        goto out;
> >> 
> >> You may close(fd) as soon as bind() returns (before checking the
> >> returned value) and eliminate goto.
> > 
> > Yeah, I thought about that alternative, but found it a bit ugly, and
> > probably also counter-productive in case this function might get extended
> > with more error pathes in future. Not that I would insist on the current
> > solution though.
> 
> I'm happy with the explanation. Thanks.
> 
> >>> +    }
> >>> +    err = chmod(addr.sun_path, mode);
> >> 
> >> I'm not sure if it is fine to have a time window between bind() and
> >> chmod(). Do you have some rationale?
> > 
> > Good question. QEMU's 9p server is multi-threaded; all 9p requests come in
> > serialized and the 9p server controller portion (9p.c) is only running on
> > QEMU main thread, but the actual filesystem driver calls are then
> > dispatched to QEMU worker threads and therefore running concurrently at
> > this point:
> > 
> > https://wiki.qemu.org/Documentation/9p#Threads_and_Coroutines
> > 
> > Similar situation on Linux 9p client side: it handles access to a mounted
> > 9p filesystem concurrently, requests are then serialized by 9p driver on
> > Linux and sent over wire to 9p server (host).
> > 
> > So yes, there might be implications by that short time windows. But could
> > that be exploited on macOS hosts in practice?
> > 
> > The socket file would have mode srwxr-xr-x for a short moment.
> > 
> > For security_model=mapped* this should not be a problem.
> > 
> > For security_model=none|passhrough, in theory, maybe? But how likely is
> > that? If you are using a Linux client for instance, trying to brute-force
> > opening the socket file, the client would send several 9p commands
> > (Twalk, Tgetattr, Topen, probably more). The time window of the two
> > commands above should be much smaller than that and I would expect one of
> > the 9p commands to error out in between.
> > 
> > What would be a viable approach to avoid this issue on macOS?
> 
> It is unlikely that a naive brute-force approach will succeed to
> exploit. The more concerning scenario is that the attacker uses the
> knowledge of the underlying implementation of macOS to cause resource
> contention to widen the window. Whether an exploitation is viable
> depends on how much time you spend digging XNU.
> 
> However, I'm also not sure if it really *has* a race condition. Looking
> at v9fs_co_mknod(), it sequentially calls s->ops->mknod() and
> s->ops->lstat(). It also results in an entity called "path name based
> fid" in the code, which inherently cannot identify a file when it is
> renamed or recreated.
> 
> If there is some rationale it is safe, it may also be applied to the
> sequence of bind() and chmod(). Can anyone explain the sequence of
> s->ops->mknod() and s->ops->lstat() or path name based fid in general?

You are talking about 9p server's controller level: I don't see something that 
would prevent a concurrent open() during this bind() ... chmod() time window 
unfortunately.

Argument 'fidp' passed to function v9fs_co_mknod() reflects the directory in 
which the new device file shall be created. So 'fidp' is not the device file 
here, nor is 'fidp' modified during this function.

Function v9fs_co_mknod() is entered by 9p server on QEMU main thread. At the 
beginning of the function it first acquires a read lock on a (per 9p export) 
global coroutine mutex:

    v9fs_path_read_lock(s);

and holds this lock until returning from function v9fs_co_mknod(). But that's 
just a read lock. Function v9fs_co_open() also just gains a read lock. So they 
can happen concurrently.

Then v9fs_co_run_in_worker({...}) is called to dispatch and execute all the 
code block (think of it as an Obj-C "block") inside this (macro actually) on a 
QEMU worker thread. So an arbitrary background thread would then call the fs 
driver functions:

    s->ops->mknod()
    v9fs_name_to_path()
    s->ops->lstat()

and then at the end of the code block the background thread would dispatch 
back to QEMU main thread. So when we are reaching:

    v9fs_path_unlock(s);

we are already back on QEMU main thread, hence unlocking on main thread now 
and finally leaving function v9fs_co_mknod().

The important thing to understand is, while that

    v9fs_co_run_in_worker({...})

code block is executed on a QEMU worker thread, the QEMU main thread (9p 
server controller portion, i.e. 9p.c) is *not* sleeping, QEMU main thread 
rather continues to process other (if any) client requests in the meantime. In 
other words v9fs_co_run_in_worker() neither behaves exactly like Apple's GCD 
dispatch_async(), nor like dispatch_sync(), as GCD is not coroutine based.

So 9p server might pull a pending 'Topen' client request from the input FIFO 
in the meantime and likewise dispatch that to a worker thread, etc. Hence a 
concurrent open() might in theory be possible, but I find it quite unlikely to 
succeed in practice as the open() call on guest is translated by Linux client 
into a bunch of synchronous 9p requests on the path passed with the open() 
call on guest, and a round trip for each 9p message is like what, ~0.3ms or 
something in this order. That's quite huge compared to the time window I would 
expect between bind() ... open().

Does this answer your questions?

> Regards,
> Akihiko Odaki
> 
> >>> +out:
> >>> +    close(fd);
> >>> +    return err;
> >>> +}
> >>> +
> >>> 
> >>>    int qemu_mknodat(int dirfd, const char *filename, mode_t mode, dev_t
> >>>    dev)
> >>>    {
> >>>    
> >>>        int preserved_errno, err;
> >>> 
> >>> @@ -93,7 +114,11 @@ int qemu_mknodat(int dirfd, const char *filename,
> >>> mode_t mode, dev_t dev)>
> >>> 
> >>>        if (pthread_fchdir_np(dirfd) < 0) {
> >>>        
> >>>            return -1;
> >>>        
> >>>        }
> >>> 
> >>> -    err = mknod(filename, mode, dev);
> >>> +    if (S_ISSOCK(mode)) {
> >>> +        err = create_socket_file_at_cwd(filename, mode);
> >>> +    } else {
> >>> +        err = mknod(filename, mode, dev);
> >>> +    }
> >>> 
> >>>        preserved_errno = errno;
> >>>        /* Stop using the thread-local cwd */
> >>>        pthread_fchdir_np(-1);