All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Dumazet <edumazet@google.com>
To: Paolo Abeni <pabeni@redhat.com>
Cc: netdev@vger.kernel.org, Steffen Froemer <sfroemer@redhat.com>
Subject: Re: [PATCH net] tcp: fix page frag corruption on page fault
Date: Fri, 26 Nov 2021 06:18:49 -0800	[thread overview]
Message-ID: <CANn89iJdg0qFvnykrtGx5OrV3zQTEtm2htTOFtaK-nNwNmOmDA@mail.gmail.com> (raw)
In-Reply-To: <d77eb546e29ce24be9ab88c47a66b70c52ce8109.1637923678.git.pabeni@redhat.com>

On Fri, Nov 26, 2021 at 4:00 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> Steffen reported a TCP stream corruption for HTTP requests
> served by the apache web-server using a cifs mount-point
> and memory mapping the relevant file.
>
> The root cause is quite similar to the one addressed by
> commit 20eb4f29b602 ("net: fix sk_page_frag() recursion from
> memory reclaim"). Here the nested access to the task page frag
> is caused by a page fault on the (mmapped) user-space memory
> buffer coming from the cifs file.
>
> The page fault handler performs an smb transaction on a different
> socket, inside the same process context. Since sk->sk_allaction
> for such socket does not prevent the usage for the task_frag,
> the nested allocation modify "under the hood" the page frag
> in use by the outer sendmsg call, corrupting the stream.
>
> The overall relevant stack trace looks like the following:
>
> httpd 78268 [001] 3461630.850950:      probe:tcp_sendmsg_locked:
>         ffffffff91461d91 tcp_sendmsg_locked+0x1
>         ffffffff91462b57 tcp_sendmsg+0x27
>         ffffffff9139814e sock_sendmsg+0x3e
>         ffffffffc06dfe1d smb_send_kvec+0x28
>         [...]
>         ffffffffc06cfaf8 cifs_readpages+0x213
>         ffffffff90e83c4b read_pages+0x6b
>         ffffffff90e83f31 __do_page_cache_readahead+0x1c1
>         ffffffff90e79e98 filemap_fault+0x788
>         ffffffff90eb0458 __do_fault+0x38
>         ffffffff90eb5280 do_fault+0x1a0
>         ffffffff90eb7c84 __handle_mm_fault+0x4d4
>         ffffffff90eb8093 handle_mm_fault+0xc3
>         ffffffff90c74f6d __do_page_fault+0x1ed
>         ffffffff90c75277 do_page_fault+0x37
>         ffffffff9160111e page_fault+0x1e
>         ffffffff9109e7b5 copyin+0x25
>         ffffffff9109eb40 _copy_from_iter_full+0xe0
>         ffffffff91462370 tcp_sendmsg_locked+0x5e0
>         ffffffff91462370 tcp_sendmsg_locked+0x5e0
>         ffffffff91462b57 tcp_sendmsg+0x27
>         ffffffff9139815c sock_sendmsg+0x4c
>         ffffffff913981f7 sock_write_iter+0x97
>         ffffffff90f2cc56 do_iter_readv_writev+0x156
>         ffffffff90f2dff0 do_iter_write+0x80
>         ffffffff90f2e1c3 vfs_writev+0xa3
>         ffffffff90f2e27c do_writev+0x5c
>         ffffffff90c042bb do_syscall_64+0x5b
>         ffffffff916000ad entry_SYSCALL_64_after_hwframe+0x65
>
> A possible solution would be adding the __GFP_MEMALLOC flag
> to the cifs allocation. That looks dangerous, as the memory
> allocated by the cifs fs will not be free soon and such
> allocation will not allow for more memory to be freed.
>
> Instead, this patch changes the tcp_sendmsg() code to avoid
> touching the page frag after performing the copy from the
> user-space buffer. Any page fault or memory reclaim operation
> there is now free to touch the task page fragment without
> corrupting the state used by the outer sendmsg().
>
> As a downside, if the user-space copy fails, there will be
> some additional atomic operations due to the reference counting
> on the faulty fragment, but that looks acceptable for a slow
> error path.
>
> Reported-by: Steffen Froemer <sfroemer@redhat.com>
> Fixes: 5640f7685831 ("net: use a per task frag allocator")
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
>  net/ipv4/tcp.c | 20 ++++++++++++--------
>  1 file changed, 12 insertions(+), 8 deletions(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index bbb3d39c69af..2d85636c1577 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1304,6 +1304,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>                         bool merge = true;
>                         int i = skb_shinfo(skb)->nr_frags;
>                         struct page_frag *pfrag = sk_page_frag(sk);
> +                       unsigned int offset;
>
>                         if (!sk_page_frag_refill(sk, pfrag))
>                                 goto wait_for_space;
> @@ -1331,14 +1332,11 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>                         if (!sk_wmem_schedule(sk, copy))
>                                 goto wait_for_space;
>
> -                       err = skb_copy_to_page_nocache(sk, &msg->msg_iter, skb,
> -                                                      pfrag->page,
> -                                                      pfrag->offset,
> -                                                      copy);
> -                       if (err)
> -                               goto do_error;
> -
> -                       /* Update the skb. */
> +                       /* Update the skb before accessing the user space buffer
> +                        * so that we leave the task frag in a consistent state.
> +                        * Just in case the page_fault handler need to use it
> +                        */
> +                       offset = pfrag->offset;
>                         if (merge) {
>                                 skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
>                         } else {
> @@ -1347,6 +1345,12 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>                                 page_ref_inc(pfrag->page);
>                         }
>                         pfrag->offset += copy;
> +
> +                       err = skb_copy_to_page_nocache(sk, &msg->msg_iter, skb,
> +                                                      pfrag->page,
> +                                                      offset, copy);
> +                       if (err)
> +                               goto do_error;
>                 } else {
>                         /* First append to a fragless skb builds initial
>                          * pure zerocopy skb
> --
> 2.33.1
>

This patch is completely wrong, you just horribly broke TCP.

Please investigate CIFS and gfpflags_normal_context() tandem to fix
this issue instead.

CIFS needs to make sure TCP will use the private socket sk->sk_frag

  reply	other threads:[~2021-11-26 14:46 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-26 12:00 [PATCH net] tcp: fix page frag corruption on page fault Paolo Abeni
2021-11-26 14:18 ` Eric Dumazet [this message]
2021-11-26 15:05   ` Paolo Abeni
2021-11-26 15:27     ` Eric Dumazet
2021-11-26 15:32       ` Eric Dumazet
2021-11-26 16:04         ` Paolo Abeni

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CANn89iJdg0qFvnykrtGx5OrV3zQTEtm2htTOFtaK-nNwNmOmDA@mail.gmail.com \
    --to=edumazet@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=sfroemer@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.