From mboxrd@z Thu Jan 1 00:00:00 1970 From: Yann Droneaud Subject: Re: CVE-2014-8159 kernel: infiniband: uverbs: unprotected physical memory access Date: Mon, 13 Apr 2015 15:29:41 +0200 Message-ID: <1428931781.22575.232.camel@opteya.com> References: <1427969085.17020.5.camel@opteya.com> <1427981431.22575.21.camel@opteya.com> <551D5DC8.6070909@mellanox.com> <1427992506.22575.80.camel@opteya.com> , <1427998401240.52348@mellanox.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: In-Reply-To: <1427998401240.52348-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> Sender: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Haggai Eran Cc: Shachar Raindel , Sagi Grimberg , linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-Id: linux-rdma@vger.kernel.org Hi, Le jeudi 02 avril 2015 =C3=A0 18:12 +0000, Haggai Eran a =C3=A9crit : > On Thursday, April 2, 2015 7:44 PM, Shachar Raindel wrote: > >> -----Original Message----- > >> From: Yann Droneaud [mailto:ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org] > >> Le jeudi 02 avril 2015 =C3=A0 18:18 +0300, Haggai Eran a =C3=A9cri= t : > >> > On 02/04/2015 16:30, Yann Droneaud wrote: > >> >> Le jeudi 02 avril 2015 =C3=A0 10:52 +0000, Shachar Raindel a =C3= =A9crit : > >> >>>> -----Original Message----- > >> >>>> From: Yann Droneaud [mailto:ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org] > >> >>>> Sent: Thursday, April 02, 2015 1:05 PM > >> >>>> Le mercredi 18 mars 2015 =C3=A0 17:39 +0000, Shachar Raindel = a =C3=A9crit : > >> >> > >> >>>>> + /* > >> >>>>> + * If the combination of the addr and size requested = for this > >> >>>> memory > >> >>>>> + * region causes an integer overflow, return error. > >> >>>>> + */ > >> >>>>> + if ((PAGE_ALIGN(addr + size) <=3D size) || > >> >>>>> + (PAGE_ALIGN(addr + size) <=3D addr)) > >> >>>>> + return ERR_PTR(-EINVAL); > >> >>>>> + > >> >>>> > >> >>>> Can access_ok() be used here ? > >> >>>> > >> >>>> if (!access_ok(writable ? VERIFY_WRITE : VERIFY_READ= , > >> >>>> addr, size)) > >> >>>> return ERR_PTR(-EINVAL); > >> >>>> > >> >>> > >> >>> No, this will break the current ODP semantics. > >> >>> > >> >>> ODP allows the user to register memory that is not accessible = yet. > >> >>> This is a critical design feature, as it allows avoiding holdi= ng > >> >>> a registration cache. Adding this check will break the behavio= r, > >> >>> forcing memory to be all accessible when registering an ODP MR= =2E > >> >>> > >> >> > >> >> Where's the check for the range being in userspace memory space= , > >> >> especially for the ODP case ? > >> >> > >> >> For non ODP case (eg. plain old behavior), does get_user_pages(= ) > >> >> ensure the requested pages fit in userspace region on all > >> >> architectures ? I think so. > >> > > >> > Yes, get_user_pages will return a smaller amount of pages than > >> requested > >> > if it encounters an unmapped region (or a region without write > >> > permissions for write requests). If this happens, the loop in > >> > ib_umem_get calls get_user_pages again with the next set of page= s, and > >> > this time if it the first page still cannot be mapped an error i= s > >> returned. > >> > > >> >> > >> >> In ODP case, I'm not sure such check is ever done ? > >> > > >> > In ODP, we also call get_user_pages, but only when a page fault = occurs > >> > (see ib_umem_odp_map_dma_pages()). This allows the user to pre- > >> register > >> > a memory region that contains unmapped virtual space, and then m= map > >> > different files into that area without needing to re-register. > >> > > >> > >> OK, thanks for the description. > >> > >> ... > >> > >> Another related question: as the large memory range could be regis= tered > >> by user space with ibv_reg_mr(pd, base, size, IB_ACCESS_ON_DEMAND)= , > >> what's prevent the kernel to map a file as the result of mmap(0, .= =2E.) > >> in this region, making it available remotely through IBV_WR_RDMA_= READ / > >> IBV_WR_RDMA_WRITE ? > >> > >=20 > > This is not a bug. This is a feature. > >=20 > > Exposing a file through RDMA, using ODP, can be done exactly like t= his. > > Given that the application explicitly requested this behavior, I do= n't > > see why it is a problem. Actually, some of our tests use such flows= =2E > > The mmu notifiers mechanism allow us to do this safely. When the pa= ge is > > written back to disk, it is removed from the ODP mapping. When it i= s > > accessed by the HCA, it is brought back to RAM. > >=20 >=20 > I want to add that we would like to see users registering a very larg= e > memory region (perhaps the entire process address space) for local > access, and then enabling remote access only to specific regions usin= g > memory windows. However, this isn't supported yet by our driver. In such scheme, the registration must still be handled "manually": one has to create a memory window to get a rkey to be exchanged with a peer, so why one would want to register such a large memory region (the whole process space) ? I guess creating the memory window is faster than registering memory region. I'd rather say this is not an excuse to register a larger=20 memory region (up to the whole process space, current and future) as it= =20 sounds like a surprising optimisation: let the HCA known too many pages just to be sure it already knows some when the process want to=20 use them. It seems it would become difficult to handle if there's too many processes. I would prefer creating memory region becoming costless (through ODP :)= =2E > Still, there are valid cases where you would still want the results > of an mmap(0,...) call to be remotely accessible, in cases where ther= e > is enough trust between the local process and the remote process. mmap(0, ...., fd) let the kernel choose where to put the file in=20 process virtual memory space, so it may, may not, partially, end up in=20 an ODP pre registered memory region for a range unallocated/unused yet. I don't think one want such to happen. > It may help a middleware communication library register a large > portion of the address space in advance, and still work with random > pointers given to it by another application module. >=20 But as said in the beginnig of your message, the middleware would have bind a memory window before posting work request / exposing rkey for the "random pointers". So I fail to understand how could be used ODP when it comes to=20 registering a memory region not yet backed by something. Regards. --=20 Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" i= n the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753996AbbDMN3y (ORCPT ); Mon, 13 Apr 2015 09:29:54 -0400 Received: from ou.quest-ce.net ([195.154.187.82]:39880 "EHLO ou.quest-ce.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751227AbbDMN3w (ORCPT ); Mon, 13 Apr 2015 09:29:52 -0400 Message-ID: <1428931781.22575.232.camel@opteya.com> From: Yann Droneaud To: Haggai Eran Cc: Shachar Raindel , Sagi Grimberg , linux-rdma@vger.kernel.org, linux-kernel@vger.kernel.org Date: Mon, 13 Apr 2015 15:29:41 +0200 In-Reply-To: <1427998401240.52348@mellanox.com> References: <1427969085.17020.5.camel@opteya.com> <1427981431.22575.21.camel@opteya.com> <551D5DC8.6070909@mellanox.com> <1427992506.22575.80.camel@opteya.com> , <1427998401240.52348@mellanox.com> Organization: OPTEYA Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.12.11 (3.12.11-1.fc21) Mime-Version: 1.0 Content-Transfer-Encoding: 8bit X-SA-Exim-Connect-IP: 37.164.31.255 X-SA-Exim-Mail-From: ydroneaud@opteya.com Subject: Re: CVE-2014-8159 kernel: infiniband: uverbs: unprotected physical memory access X-SA-Exim-Version: 4.2.1 (built Mon, 26 Dec 2011 16:24:06 +0000) X-SA-Exim-Scanned: Yes (on ou.quest-ce.net) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, Le jeudi 02 avril 2015 à 18:12 +0000, Haggai Eran a écrit : > On Thursday, April 2, 2015 7:44 PM, Shachar Raindel wrote: > >> -----Original Message----- > >> From: Yann Droneaud [mailto:ydroneaud@opteya.com] > >> Le jeudi 02 avril 2015 à 18:18 +0300, Haggai Eran a écrit : > >> > On 02/04/2015 16:30, Yann Droneaud wrote: > >> >> Le jeudi 02 avril 2015 à 10:52 +0000, Shachar Raindel a écrit : > >> >>>> -----Original Message----- > >> >>>> From: Yann Droneaud [mailto:ydroneaud@opteya.com] > >> >>>> Sent: Thursday, April 02, 2015 1:05 PM > >> >>>> Le mercredi 18 mars 2015 à 17:39 +0000, Shachar Raindel a écrit : > >> >> > >> >>>>> + /* > >> >>>>> + * If the combination of the addr and size requested for this > >> >>>> memory > >> >>>>> + * region causes an integer overflow, return error. > >> >>>>> + */ > >> >>>>> + if ((PAGE_ALIGN(addr + size) <= size) || > >> >>>>> + (PAGE_ALIGN(addr + size) <= addr)) > >> >>>>> + return ERR_PTR(-EINVAL); > >> >>>>> + > >> >>>> > >> >>>> Can access_ok() be used here ? > >> >>>> > >> >>>> if (!access_ok(writable ? VERIFY_WRITE : VERIFY_READ, > >> >>>> addr, size)) > >> >>>> return ERR_PTR(-EINVAL); > >> >>>> > >> >>> > >> >>> No, this will break the current ODP semantics. > >> >>> > >> >>> ODP allows the user to register memory that is not accessible yet. > >> >>> This is a critical design feature, as it allows avoiding holding > >> >>> a registration cache. Adding this check will break the behavior, > >> >>> forcing memory to be all accessible when registering an ODP MR. > >> >>> > >> >> > >> >> Where's the check for the range being in userspace memory space, > >> >> especially for the ODP case ? > >> >> > >> >> For non ODP case (eg. plain old behavior), does get_user_pages() > >> >> ensure the requested pages fit in userspace region on all > >> >> architectures ? I think so. > >> > > >> > Yes, get_user_pages will return a smaller amount of pages than > >> requested > >> > if it encounters an unmapped region (or a region without write > >> > permissions for write requests). If this happens, the loop in > >> > ib_umem_get calls get_user_pages again with the next set of pages, and > >> > this time if it the first page still cannot be mapped an error is > >> returned. > >> > > >> >> > >> >> In ODP case, I'm not sure such check is ever done ? > >> > > >> > In ODP, we also call get_user_pages, but only when a page fault occurs > >> > (see ib_umem_odp_map_dma_pages()). This allows the user to pre- > >> register > >> > a memory region that contains unmapped virtual space, and then mmap > >> > different files into that area without needing to re-register. > >> > > >> > >> OK, thanks for the description. > >> > >> ... > >> > >> Another related question: as the large memory range could be registered > >> by user space with ibv_reg_mr(pd, base, size, IB_ACCESS_ON_DEMAND), > >> what's prevent the kernel to map a file as the result of mmap(0, ...) > >> in this region, making it available remotely through IBV_WR_RDMA_READ / > >> IBV_WR_RDMA_WRITE ? > >> > > > > This is not a bug. This is a feature. > > > > Exposing a file through RDMA, using ODP, can be done exactly like this. > > Given that the application explicitly requested this behavior, I don't > > see why it is a problem. Actually, some of our tests use such flows. > > The mmu notifiers mechanism allow us to do this safely. When the page is > > written back to disk, it is removed from the ODP mapping. When it is > > accessed by the HCA, it is brought back to RAM. > > > > I want to add that we would like to see users registering a very large > memory region (perhaps the entire process address space) for local > access, and then enabling remote access only to specific regions using > memory windows. However, this isn't supported yet by our driver. In such scheme, the registration must still be handled "manually": one has to create a memory window to get a rkey to be exchanged with a peer, so why one would want to register such a large memory region (the whole process space) ? I guess creating the memory window is faster than registering memory region. I'd rather say this is not an excuse to register a larger memory region (up to the whole process space, current and future) as it sounds like a surprising optimisation: let the HCA known too many pages just to be sure it already knows some when the process want to use them. It seems it would become difficult to handle if there's too many processes. I would prefer creating memory region becoming costless (through ODP :). > Still, there are valid cases where you would still want the results > of an mmap(0,...) call to be remotely accessible, in cases where there > is enough trust between the local process and the remote process. mmap(0, ...., fd) let the kernel choose where to put the file in process virtual memory space, so it may, may not, partially, end up in an ODP pre registered memory region for a range unallocated/unused yet. I don't think one want such to happen. > It may help a middleware communication library register a large > portion of the address space in advance, and still work with random > pointers given to it by another application module. > But as said in the beginnig of your message, the middleware would have bind a memory window before posting work request / exposing rkey for the "random pointers". So I fail to understand how could be used ODP when it comes to registering a memory region not yet backed by something. Regards. -- Yann Droneaud OPTEYA