From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753560AbdJLIGd (ORCPT ); Thu, 12 Oct 2017 04:06:33 -0400 Received: from mail-ve1eur01on0093.outbound.protection.outlook.com ([104.47.1.93]:59296 "EHLO EUR01-VE1-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1750716AbdJLIGa (ORCPT ); Thu, 12 Oct 2017 04:06:30 -0400 Authentication-Results: spf=none (sender IP is ) smtp.mailfrom=avagin@virtuozzo.com; Date: Thu, 12 Oct 2017 01:06:15 -0700 From: Andrei Vagin To: Alexey Dobriyan Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, rdunlap@infradead.org, tglx@linutronix.de, tixxdz@gmail.com, gladkov.alexey@gmail.com Subject: Re: [1/2,v2] fdmap(2) Message-ID: <20171012080608.GA23077@outlook.office365.com> References: <20170924200620.GA24368@avx2> <20171010220804.GA30735@outlook.office365.com> <20171011181234.GB2119@avx2> MIME-Version: 1.0 Content-Type: text/plain; charset=koi8-r Content-Disposition: inline In-Reply-To: <20171011181234.GB2119@avx2> User-Agent: Mutt/1.8.3 (2017-05-23) X-Originating-IP: [73.140.212.29] X-ClientProxiedBy: BN6PR11CA0042.namprd11.prod.outlook.com (2603:10b6:404:4b::28) To VI1PR08MB0751.eurprd08.prod.outlook.com (2a01:111:e400:5a04::13) X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 35df0a4c-1b00-432d-7a42-08d511482302 X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:(22001)(2017030254152)(2017052603199)(201703131423075)(201703031133081)(201702281549075);SRVR:VI1PR08MB0751; X-Microsoft-Exchange-Diagnostics: 1;VI1PR08MB0751;3:3JopV8OhT5JRsMjah3UId3ib4sHrP6KJLB1lh0Cy8A68MtXfrTVdtpdDaj8N1M7XrIH/fOY2uKhgZUYBCG0sbihHqCZvITNYxCu1zVnCWglhG0SCvxfXO11zk688oYQ870N84vJhQ5xvXi6t6clNSwPRlt7TWAtLSclazcFh04Q04n2544tUXgJNXjXS7cSf44xfQle/7j0BHhGd+hTCKIoXRRnrjFwTOpaLgaAZuXMRIWBl4UpzY2GzBaADvDGM;25:k41fOnx+fctxY+7wN3PXiILQwlrJGaBYIlx/ul61F8C09/mYFtySVom3q9XHUc/mDxrB+V1Qkd5FeWVcP5SIIciI8wqgxPu4Dg6Jm1vbYmbgG0zOOX9SvU0V1q9zCR1UzRGWZNxFe1Vs7Ws1G3yBp5zvi8+msxGQJaUoVYhXeomEdwJ9GLp1cNe4D3P6mLGksbrcjpLMSOcHts2wGzNSx7jAKJLvTII1dxXGG9mpCjCSLjgsGhMoiQ+1pDO6MPN8ydln38VEIUDUH2Y/xSvWQHasCj+yyRjQo+nHrfGbVreAHszflkVhk1GLJrYFsRkFqDkE0SEoP3bH6LJqmfKYbw==;31:1t6jramdS61IU1UQQCD0ARxlkW7sf5PolGdnAQsU9QGEM0EE4kGlaceHkWFNtxS2ed3ehT97kCYc61OS+GUSG4xfERncn2gCMNMFsEatRpnqp6u96w+lH41pugrwNURv4IE7n82s6QZHUP/86H6+GqkUkYfTwLflDqr7nRuc/sFJaKU5ikylcvM187WClfSpBl3JOPMtQzyoeV9QPXlofWanwJthFwyJyPUS13mBw6A= X-MS-TrafficTypeDiagnostic: VI1PR08MB0751: X-Microsoft-Exchange-Diagnostics: 1;VI1PR08MB0751;20:gnF8msxO8y06b4VmCBizUIRkU8dfRQDXRsquZHv4FvKyla7lHXTLrVSAWyneHsydl1mVUdx4eRb9UutbTJAGRwXz43Kf4jzIQKYl8sVpIQNPc14cVEWDYAxdMDYERqiZwtVWTkzHHQqaFs+7d+StQkFt83L3gRzNS71c71vzlZw5p//qWmk0aWP4FHpLWD28JbcIGTB+o5aaAW29Xy26lFfD6UxZjyyeWsFYvjn/35czioVNDgib4Ep052h5mv/yY1taM8Gr9+cErKzuLs0TOSYTDN/kXOn7nGOuEBbNGfH0TQkDoo5k21QZonFoQV3YydL+CxHLU5EXu3z69JxV9lp3zlTOsdVD8Dc8Ng2nUVMkkrFncY/KkjF2rO9L3xi/zkqvFLyLqVtgE/c2f5zX2EkJetIQlchIsdxVTPUT06c=;4:mUTSkiOsXUDnH7zfrfAlcGjhbVRPzsTETU9JczqyxKFE1eEvwpBq9JfOJ/geEf3rjWdYAvRzTjlScdrrcNu9iwnAW/pyM+3Orq5WI32K204g7rx511V/Zh4O6gKr5P0lBcfiyhd8ZgnWj2hn/UOVMI5HMeuKQ5USONYYKPIrd6HolRFQzYPe/fka0mRF6HWfDanxYjAq2AJhE9gvDs3rpv+06Ixb1cCHS6N5vl1M6sKCXhTIYc+zxFN8XYmasyg4KLbh/nTgCYNzqelgnsRnQ/UxtTzGMKVkT7YfYV5webs= X-Exchange-Antispam-Report-Test: UriScan:(4114951738403); X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(100000700101)(100105000095)(100000701101)(100105300095)(100000702101)(100105100095)(6040450)(2401047)(5005006)(8121501046)(3002001)(100000703101)(100105400095)(93006095)(93001095)(10201501046)(6041248)(20161123558100)(20161123564025)(20161123562025)(20161123555025)(20161123560025)(201703131423075)(201702281528075)(201703061421075)(201703061406153)(6072148)(201708071742011)(100000704101)(100105200095)(100000705101)(100105500095);SRVR:VI1PR08MB0751;BCL:0;PCL:0;RULEID:(100000800101)(100110000095)(100000801101)(100110300095)(100000802101)(100110100095)(100000803101)(100110400095)(100000804101)(100110200095)(100000805101)(100110500095);SRVR:VI1PR08MB0751; X-Forefront-PRVS: 04583CED1A X-Forefront-Antispam-Report: SFV:NSPM;SFS:(10019020)(6009001)(376002)(346002)(24454002)(199003)(189002)(305945005)(81166006)(68736007)(54356999)(50986999)(55016002)(105586002)(6506006)(6246003)(39060400002)(76176999)(66066001)(478600001)(53936002)(4326008)(9686003)(69596002)(86362001)(25786009)(53416004)(6116002)(2906002)(50466002)(7736002)(6666003)(33656002)(106356001)(3846002)(551934003)(1076002)(229853002)(47776003)(23686003)(58126008)(1411001)(16586007)(189998001)(316002)(6916009)(8936002)(8676002)(97736004)(83506001)(5660300001)(81156014)(16526018)(2950100002)(101416001)(18370500001)(142933001);DIR:OUT;SFP:1102;SCL:1;SRVR:VI1PR08MB0751;H:outlook.office365.com;FPR:;SPF:None;PTR:InfoNoRecords;A:1;MX:1;LANG:en; X-Microsoft-Exchange-Diagnostics: =?koi8-r?Q?1;VI1PR08MB0751;23:EjTicd4xCA7tu/Gq45R+1rWg5lAHic7WYgUGCm2SPUP?= =?koi8-r?Q?gwIWDsDYm726RgwuWqzIdWCFP54RPy3W3Q0c7Ix8QrvKUl2Pgven9uIP9pmvLb?= =?koi8-r?Q?uNBcUydWobpEcAZJDVj8N1g/lFrrTiTTJXvjegtHJS0olRg08l1qx2XRGJSddB?= =?koi8-r?Q?SSxra4klJMmmb3GWuEx9BHik9mSCO/UC4v1WFLTk+r8Ru4SPOCFZP7yGI4+pY5?= =?koi8-r?Q?b9xvsT2f5RTBs5vBtkvHIEHJsFcQDQge+rAdS/CLwa4YpHlXmr6StDmhxRZgPf?= =?koi8-r?Q?JGhGO8jioI9NbdhD71mDiwgqKhiOBFDgwYom2KCHu9OmQYM2BuSnJCznf89USN?= =?koi8-r?Q?msTcVY3k6+jt6HuJjhbcKsBTzjq6DoefitmfoZW1gnmIHyJHbnfwREw7kZB57s?= =?koi8-r?Q?UT8hfIpIRLogCFNPNvNOStt4/b84lYm8c7tmvRt/OO5lpySgRhvOfmSdNSFP9N?= =?koi8-r?Q?NiE647hx3rwLcGZkOASi/MwHE8ZHKWZqxc2z4MrxuOErt2IXQjqPNnNR5DJqEy?= =?koi8-r?Q?W0/QUDzVv/H89wMUV9LZdtTrGEmllJxMOSMeiCdrmpKPVTuiMweigrA59MKiEi?= =?koi8-r?Q?bK0txSIHjy8p2HXPukKq7FUdcr7AmPsXA1/nEUC6sR+MH7xsGimZ8koYbATbeI?= =?koi8-r?Q?nFA2r7ENBXBufNA0TNyEr18mym0rK5mc7beBHHI/2+ZkrPR8tKTiWeOqr7MNbO?= =?koi8-r?Q?i1L/S5LaVYNMC+A7JfNBKMpa3622x7DVIX+/YoeVOwk5EYAD50AcDm4yXgBREf?= =?koi8-r?Q?ckQv0k8bEIVdFkGrjYapNPNFVSNPk4L4re11Wyz5ct1zwU7ErLhK2CUPW9+ahP?= =?koi8-r?Q?deVF/AhO8KwSoB5ZqolIuHOl4Y32XO4ydn8KG8V/afkcK6R6A3w2xILHh5qttp?= =?koi8-r?Q?WPtBSM3ZBjxOU4OU+OIsw7Mk/U8o1N8YC17mC+eFRGerlDxY8jrjMzOzBmf5/0?= =?koi8-r?Q?ND+BFjUDH2hMZgO7PhSINZ/QJhkUWlzlwsIELNok8ZV11m2sRfOt0aCst9Nzuj?= =?koi8-r?Q?vUuIdBjYlAgLseGYp3JypK6jU01IFkEceucrkWKdpBI/ei9bjzAPjqww9J0dS2?= =?koi8-r?Q?DUqdxsxlR7w3c8VPjr9K+Y207BzvSl0Mx6eDRLrjZMtHpNFZuytXlhQ4EnYHoh?= =?koi8-r?Q?J+XH+nFp5W9dkjvt+xHmDHi88y4sC59iLtdRraKo6GQAIurDd06r43QTdznnU6?= =?koi8-r?Q?UtYRC2U2cbw4bz0DzeijeZsQk2ej+rZWVRqloxYdCMhLNWLaMca6+499Jv3oYD?= =?koi8-r?Q?V34ZuZCT+qQkaZ2OzuQ=3D=3D?= X-Microsoft-Exchange-Diagnostics: 1;VI1PR08MB0751;6:KxBIJensbPs/jesferzl6jHvKkdCInDFz790MxIE/oJdaZKaprYTNj6aBo5anqQT7poKD/MbVtQWHVhv0A1KeUpo7rsCRf/6e4qH3hjs3Ks9kHB3XS8sOOmIIV63NQCuO5PYwBn5T+ELZescg2UqI6G4R2UHzycItTrlWu4Nh1aqtBQ1JvBsTOjLeP0ZqKTd0JaQPKw1c4f4CAIhv8pfVeTx1hgqo3claw+MR9f8UWaZYSnxaf6T6NBv21ntjdu25+mLgi66VIlVMORhJJvJehFEv9/xuVSQZNcf1oURJfuhmue/pMd83yWos8AAdig8ovR/Hyu6WWx7DWRejv6sfQ==;5:R/sz9bEdEVDCIYyu8mwIuLhC+Yd9V4RZ8/RGRZmB2ZU0Yf/0W4kCNjmpRy0EltHo6iLzhtlYw/k5H5uDXT1KHGGeICe/WZGZKueqqo0+yVGEhBb9NwxzqBq056/miZCz7VRRH8yA+vRwNzNqMoc00A==;24:f9dduXyAJw0l6LIdWDWwVSf3HtXGWx0mGhwtuieSFrQzAmS8BqBaITnJXiPw2GFfznFHrpdWJtry+J8BNH3NIpXfHYC0x2Pkhm8oQ3GOP+o=;7:I0+ojWO3QZzllhd4XJbTkQOtV3pLcqdeDwr/BlMS7qBC+dMJ9FxpJ5gQjs+BRSfelXqipROXVl5JHfAGf2fmkLbwND5IbQiPI/SllfneYSYreUATIleoPLmxiULhK6pz5opU0hZ1qbbRy9omc0Q2KjWBkyWfo0pYHEAV78pwR3817vf3jFjjlEnhMb5tpvQHWZrwnpCDQP3wKvHUNw1RKB04wN1gEg2uI8SJuhQyY6I= SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-Microsoft-Exchange-Diagnostics: 1;VI1PR08MB0751;20:RZPchYH9GTlvi8XQMgfbIBFaMMGDIsZPY6hsXYx3HY/rVo+XvNpXrDfrOfil0FnsrWQFtaY5uGA4tRiqqiYjk91b9+/udqnp+eJYw8ooMNqX0kH7ysdMpE1sul6S4WH8uMi4P9gyyJuTozXZ8gWiJoSq/tVo/wpQFcjKvWMUUWY= X-OriginatorOrg: virtuozzo.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 12 Oct 2017 08:06:24.3599 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 0bc7f26d-0264-416e-a6fc-8352af79c58f X-MS-Exchange-Transport-CrossTenantHeadersStamped: VI1PR08MB0751 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Oct 11, 2017 at 09:12:34PM +0300, Alexey Dobriyan wrote: > On Tue, Oct 10, 2017 at 03:08:06PM -0700, Andrei Vagin wrote: > > On Sun, Sep 24, 2017 at 11:06:20PM +0300, Alexey Dobriyan wrote: > > > From: Aliaksandr Patseyenak > > > > > > Implement system call for bulk retrieveing of opened descriptors > > > in binary form. > > > > > > Some daemons could use it to reliably close file descriptors > > > before starting. Currently they close everything upto some number > > > which formally is not reliable. Other natural users are lsof(1) and CRIU > > > (although lsof does so much in /proc that the effect is thoroughly buried). > > > > Hello Alexey, > > > > I am not sure about the idea to add syscalls for all sort of process > > attributes. For example, in CRIU we need file descriptors with their > > properties, which we currently get from /proc/pid/fdinfo/. How can > > this interface be extended to achieve our goal? > > > > Have you seen the task-diag interface what I sent about a year ago? > > Of course, let's discuss /proc/task_diag. > > Adding it as /proc file is obviously unnecessary: you do it only > to hook ->read and ->write netlink style > (and BTW you don't need .THIS_MODULE anymore ;-) > > Transactional netlink send and recv aren't necessary either. > As I understand it, it comes from old times when netlink was async, > so 2 syscalls were neccesary. Netlink is not async anymore. > > Basically you want to do sys_task_diag(2) which accepts set of pids > (maybe) and a mask (see statx()) and returns synchronously result into > a buffer. You are not quite right here. We send a request and then we read a response, which can be bigger than what we can read for one call. So we need something like a cursor, in your case it is the "start" argument. But sometimes this cursor contains a kernel internal data to have a better performance. We need to have a way to address this cursor from userspace, and it is a reason why we need a file descriptor in this scheme. For example, you can look at the proc_maps_private structure. > > > We had a discussion on the previous kernel summit how to rework > > task-diag, so that it can be merged into the upstream kernel. > > Unfortunately, I didn't send a summary for this discussion. But it's > > better now than never. We decided to do something like this: > > > > 1. Add a new syscall readfile(fname, buf, size), which can be > > used to read small files without opening a file descriptor. It will be > > useful for proc files, configs, etc. > > If nothing, it should be done because the number of programmers capable > of writing readfile() in userspace correctly handling all errors and > short reads is very small indeed. Out of curiosity I once booted a kernel > which made all reads short by default. It was fascinating I can tell you. > > > 2. bin/text/bin conversion is very slow > > - 65.47% proc_pid_status > > - 20.81% render_sigset_t > > - 18.27% seq_printf > > + 15.77% seq_vprintf > > - 10.65% task_mem > > + 8.78% seq_print > > + 1.02% hugetlb_rep > > + 7.40% seq_printf > > so a new interface has to use a binary format and the format of netlink > > messages can be used here. It should be possible to extend a file > > without breaking backward compatibility. > > Binary -- yes. > netlink attributes -- maybe. > > There is statx() model which is perfect for this usecase: > do not want pagecache of all block devices? sure, no problem. > > > 3. There are a lot of objection to use a netlink sockets out of the network > > subsystem. The idea of using a "transaction" file looks weird for many > > people, so we decided to add a few files in /proc/pid/. I see > > minimum two files. One file contains information about a task, it is > > mostly what we have in /proc/pid/status and /proc/pid/stat. Another file > > describes a task memory, it is what we have now in /proc/pid/smaps. > > Here is one more major idea. All attributes in a file has to be equal in > > term of performance, or by other words there should not be attributes, > > which significantly affect a generation time of a whole file. > > > > If we look at /proc/pid/smaps, we spend a lot of time to get memory > > statistics. This file contains a lot of data and if you read it to get > > VmFlags, the kernel will waste your time by generating a useless data > > for you. > > There is a unsolvable problem with /proc/*/stat style files. Anyone > who wants to add new stuff has a desicion to make, whether add new /proc > file or extend existing /proc file. > > Adding new /proc file means 3 syscalls currently, it surely will become > better with aforementioned readfileat() but even adding tons of symlinks > like this: > > $ readlink /proc/self/affinity > 0f > > would have been better -- readlink doesn't open files. > > Adding to existing file means _all_ users have to eat the cost as > read(2) doesn't accept any sort of mask to filter data. Most /proc files > are seqfiles now which most of the time internally generates whole buffer > before shipping data to userspace. cat(1) does 32KB read by default > which is bigger than most of files in /proc and stat'ing /proc files is > useless because they're all 0 length. Reliable rewinding to necessary data > is possible only with memchr() which misses the point. > > Basically, those sacred text files the Universe consists of suck. > > With statx() model the cost of extending result with new data is very > small -- 1 branch to skip generation of data. > > I suggest that anyone who dares to improve the situation with process > statistics and anything /proc related uses it as a model. > > Of course, I also suggest to freeze /proc for new stuff to press > the issue but one can only dream. I'm agree with your points, but I think you choose a wrong set of data to make an example of a new approach. You are talking a lot about statx, but for me it is unclear how fdmap follows the idea of statx. Let's imagine that I want to extend fdmap to return mnt_id for each file descriptor? Or it may be more complex case, when we decided to provide all data from /proc/pid/fdinfo/X for each descriptor. A set of fields in fdinfo depends on a type of a file descriptor, it is different for epoll, signalfd, inotify, sockets, etc. For inotify file descriptors, there are information about all watches, so it is not possible to use a fixed size struture to present this data. I like the interface of statx, but this case is more complex. Thanks, Andrei From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrei Vagin Subject: Re: [1/2,v2] fdmap(2) Date: Thu, 12 Oct 2017 01:06:15 -0700 Message-ID: <20171012080608.GA23077@outlook.office365.com> References: <20170924200620.GA24368@avx2> <20171010220804.GA30735@outlook.office365.com> <20171011181234.GB2119@avx2> Mime-Version: 1.0 Content-Type: text/plain; charset=koi8-r Return-path: Content-Disposition: inline In-Reply-To: <20171011181234.GB2119@avx2> Sender: linux-api-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Alexey Dobriyan Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, rdunlap-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org, tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org, tixxdz-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org, gladkov.alexey-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org List-Id: linux-api@vger.kernel.org On Wed, Oct 11, 2017 at 09:12:34PM +0300, Alexey Dobriyan wrote: > On Tue, Oct 10, 2017 at 03:08:06PM -0700, Andrei Vagin wrote: > > On Sun, Sep 24, 2017 at 11:06:20PM +0300, Alexey Dobriyan wrote: > > > From: Aliaksandr Patseyenak > > > > > > Implement system call for bulk retrieveing of opened descriptors > > > in binary form. > > > > > > Some daemons could use it to reliably close file descriptors > > > before starting. Currently they close everything upto some number > > > which formally is not reliable. Other natural users are lsof(1) and CRIU > > > (although lsof does so much in /proc that the effect is thoroughly buried). > > > > Hello Alexey, > > > > I am not sure about the idea to add syscalls for all sort of process > > attributes. For example, in CRIU we need file descriptors with their > > properties, which we currently get from /proc/pid/fdinfo/. How can > > this interface be extended to achieve our goal? > > > > Have you seen the task-diag interface what I sent about a year ago? > > Of course, let's discuss /proc/task_diag. > > Adding it as /proc file is obviously unnecessary: you do it only > to hook ->read and ->write netlink style > (and BTW you don't need .THIS_MODULE anymore ;-) > > Transactional netlink send and recv aren't necessary either. > As I understand it, it comes from old times when netlink was async, > so 2 syscalls were neccesary. Netlink is not async anymore. > > Basically you want to do sys_task_diag(2) which accepts set of pids > (maybe) and a mask (see statx()) and returns synchronously result into > a buffer. You are not quite right here. We send a request and then we read a response, which can be bigger than what we can read for one call. So we need something like a cursor, in your case it is the "start" argument. But sometimes this cursor contains a kernel internal data to have a better performance. We need to have a way to address this cursor from userspace, and it is a reason why we need a file descriptor in this scheme. For example, you can look at the proc_maps_private structure. > > > We had a discussion on the previous kernel summit how to rework > > task-diag, so that it can be merged into the upstream kernel. > > Unfortunately, I didn't send a summary for this discussion. But it's > > better now than never. We decided to do something like this: > > > > 1. Add a new syscall readfile(fname, buf, size), which can be > > used to read small files without opening a file descriptor. It will be > > useful for proc files, configs, etc. > > If nothing, it should be done because the number of programmers capable > of writing readfile() in userspace correctly handling all errors and > short reads is very small indeed. Out of curiosity I once booted a kernel > which made all reads short by default. It was fascinating I can tell you. > > > 2. bin/text/bin conversion is very slow > > - 65.47% proc_pid_status > > - 20.81% render_sigset_t > > - 18.27% seq_printf > > + 15.77% seq_vprintf > > - 10.65% task_mem > > + 8.78% seq_print > > + 1.02% hugetlb_rep > > + 7.40% seq_printf > > so a new interface has to use a binary format and the format of netlink > > messages can be used here. It should be possible to extend a file > > without breaking backward compatibility. > > Binary -- yes. > netlink attributes -- maybe. > > There is statx() model which is perfect for this usecase: > do not want pagecache of all block devices? sure, no problem. > > > 3. There are a lot of objection to use a netlink sockets out of the network > > subsystem. The idea of using a "transaction" file looks weird for many > > people, so we decided to add a few files in /proc/pid/. I see > > minimum two files. One file contains information about a task, it is > > mostly what we have in /proc/pid/status and /proc/pid/stat. Another file > > describes a task memory, it is what we have now in /proc/pid/smaps. > > Here is one more major idea. All attributes in a file has to be equal in > > term of performance, or by other words there should not be attributes, > > which significantly affect a generation time of a whole file. > > > > If we look at /proc/pid/smaps, we spend a lot of time to get memory > > statistics. This file contains a lot of data and if you read it to get > > VmFlags, the kernel will waste your time by generating a useless data > > for you. > > There is a unsolvable problem with /proc/*/stat style files. Anyone > who wants to add new stuff has a desicion to make, whether add new /proc > file or extend existing /proc file. > > Adding new /proc file means 3 syscalls currently, it surely will become > better with aforementioned readfileat() but even adding tons of symlinks > like this: > > $ readlink /proc/self/affinity > 0f > > would have been better -- readlink doesn't open files. > > Adding to existing file means _all_ users have to eat the cost as > read(2) doesn't accept any sort of mask to filter data. Most /proc files > are seqfiles now which most of the time internally generates whole buffer > before shipping data to userspace. cat(1) does 32KB read by default > which is bigger than most of files in /proc and stat'ing /proc files is > useless because they're all 0 length. Reliable rewinding to necessary data > is possible only with memchr() which misses the point. > > Basically, those sacred text files the Universe consists of suck. > > With statx() model the cost of extending result with new data is very > small -- 1 branch to skip generation of data. > > I suggest that anyone who dares to improve the situation with process > statistics and anything /proc related uses it as a model. > > Of course, I also suggest to freeze /proc for new stuff to press > the issue but one can only dream. I'm agree with your points, but I think you choose a wrong set of data to make an example of a new approach. You are talking a lot about statx, but for me it is unclear how fdmap follows the idea of statx. Let's imagine that I want to extend fdmap to return mnt_id for each file descriptor? Or it may be more complex case, when we decided to provide all data from /proc/pid/fdinfo/X for each descriptor. A set of fields in fdinfo depends on a type of a file descriptor, it is different for epoll, signalfd, inotify, sockets, etc. For inotify file descriptors, there are information about all watches, so it is not possible to use a fixed size struture to present this data. I like the interface of statx, but this case is more complex. Thanks, Andrei