From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1767306AbXCIOPA (ORCPT ); Fri, 9 Mar 2007 09:15:00 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1767311AbXCIOPA (ORCPT ); Fri, 9 Mar 2007 09:15:00 -0500 Received: from mx1.redhat.com ([66.187.233.31]:46984 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1767306AbXCIOO6 (ORCPT ); Fri, 9 Mar 2007 09:14:58 -0500 In-Reply-To: <3378.1173204813@redhat.com> References: <3378.1173204813@redhat.com> <20070216165042.GB409@lnx-holt.americas.sgi.com> <45D5B483.3020502@hitachi.com> <45D5B2E3.3030607@hitachi.com> <20368.1171638335@redhat.com> <18817.1171656543@redhat.com> <29317.1172931029@redhat.com> To: Hugh Dickins , bryan.wu@analog.com Cc: Robin Holt , "Kawai, Hidehiro" , Andrew Morton , kernel list , Pavel Machek , Alan Cox , Masami Hiramatsu , sugita , Satoshi OSHIMA , haoki@redhat.com, Robin Getz Subject: Move to unshared VMAs in NOMMU mode? X-Mailer: MH-E 8.0; nmh 1.1; GNU Emacs 22.0.50 Date: Fri, 09 Mar 2007 14:12:02 +0000 Message-ID: <12852.1173449522@redhat.com> From: David Howells Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org I've been considering how to deal with the SYSV SHM problem, and I think we may have to move to unshared VMAs in NOMMU mode to deal with this. Currently, what we have is each mm_struct has in its arch-specific context argument a list of VMLs. Take the FRV context for example: [include/asm-frv/mmu.h] typedef struct { #ifdef CONFIG_MMU ... struct vm_list_struct *vmlist; unsigned long end_brk; #endif ... } mm_context_t; Each VML struct containes a pointer to a systemwide VMA and the next VML in the list: struct vm_list_struct { struct vm_list_struct *next; struct vm_area_struct *vma; }; The VMAs themselves are kept in an rb-tree in mm/nommu.c: /* list of shareable VMAs */ struct rb_root nommu_vma_tree = RB_ROOT; which can then be displayed through /proc/maps. There are some restrictions of this system, mainly due to the NOMMU constraints: (*) mmap() may not be used to overlay one mapping upon another (*) mmap() may not be used with MAP_FIXED. (*) mmap()'s of the same part of the same file will result in multiple mappings returning the same base address, assuming the maps are shareable. If they aren't shareable, they'll be at different base addresses. (*) for normal shareable file mappings, two mappings will only be shared if they precisely match offset, size and protection, otherwise a new mapping will be created (this is because VMAs will be shared). Splitting VMAs would reduce the this restriction, though subsequent mappings would have to be bounded by the first mapping, but wouldn't have to be the same size. (*) munmap() may only unmap a precise match amongst the mappings made; it may not be used to cut down or punch a hole in an existing mapping. The VMAs for private file mappings, private blockdev mappings and anonymous mappings, be they shared[*] or unshared, hold a pointer to the kmalloc()'d region of memory in which the mapping contents reside. This region is discarded when the VMA is deleted. When a region can be shared the VMA is also shared, and so no reference counting need take place on the mapping contents as that is implied by the VMA. [*] MAP_PRIVATE+!PROT_WRITE+!PT_PTRACED regions may be shared Note that for mappable chardevs with special BDI capability flags, extra VMAs may be allocated because (a) they may need to overlap non-exactly, and (b) the chardev itself pins the backing storage, if the backing storage is potentially transient. If VMAs are not shared for shared memory regions then some other means of retaining the actual allocated memory region must be found. The obvious way to do this is to have the VMA point to a shared, refcounted record that keeps track of the region: struct vm_region { /* the first parameters define the region as for the VMA */ pgprot_t vm_page_prot; unsigned long vm_start; unsigned long vm_end unsigned long vm_pgoff; struct file *vm_file; atomic_t vm_usage; /* region usage count */ struct rb_node vm_rb; /* region tree */ }; The VMA itself would then have to be modified to include a pointer to this, but wouldn't then need its own refcount. VMAs would belong, once again, to the mm_struct, the VML struct would vanish, and the VML list rooted in mm_context_t would vanish. For R/O shareable file mappings, it might be possible to actually use the target file's pagecache for the mapping. I do something of that sort for shared-writable mappings on ramfs files (to support POSIX SHM and SYSV SHM). The downside of allocating all these extra VMAs is that, of course, it takes up more memory, though that may not be too bad, especially if it's at the gain of additional consistency with the MM code. However, consistency isn't for the most part a real issue. As I see it, drivers and filesystems should not concern themselves with anything other than the VMA they're given, and so it doesn't matter if these are shared or not. That brings us on to the problem with SYSV SHM which keeps an attachment count that the VMA mmap(), open() and release() ops manipulate. This means that the nattch count comes out wrong on NOMMU systems. Note that on MMU systems, doing a munmap() in the middle of an attached region will *also* break the nattch count, though this is self-correcting. Another way of dealing with the nattch count on NOMMU systems is to do it through the VML list, but that then needs more special casing in the SHM driver and perhaps others. Thoughts? David