linux-man.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Alejandro Colomar <alx.manpages@gmail.com>
To: Colin Watson <cjwatson@debian.org>, Sam James <sam@gentoo.org>
Cc: Alexis <flexibeast@gmail.com>,
	groff@gnu.org, linux-man <linux-man@vger.kernel.org>,
	Ingo Schwarze <schwarze@usta.de>,
	Ralph Corderoy <ralph@inputplus.co.uk>,
	Dirk Gouders <dirk@gouders.net>
Subject: Re: Compressed man pages (was: Accessibility of man pages (was: Playground pager lsp(1)))
Date: Sun, 9 Apr 2023 15:36:05 +0200	[thread overview]
Message-ID: <bddac44c-4495-4323-4051-e8ec083b62af@gmail.com> (raw)
In-Reply-To: <ZDKvl/7YgzpZ8Bix@riva.ucam.org>


[-- Attachment #1.1: Type: text/plain, Size: 5842 bytes --]



On 4/9/23 14:29, Colin Watson wrote:
> On Sun, Apr 09, 2023 at 02:05:08PM +0200, Alejandro Colomar wrote:
>> Important note: Sam, are you sure you want your pages compressed
>> with bz2?  Have you seen the 10 seconds it takes man-db's man(1) to
>> find a word in the pages?  I suggest that at least you try to
>> reproduce these tests in your machine, and see if it's just me or
>> man-db's man(1) is pretty bad at non-gz pages.
> 
> man-db is significantly slower with bzip2 than gzip these days, because
> much of the performance work I did in 2.10.0 only applies to gzip:
> there's in-process support for decompressing gzip, but we use
> subprocesses for bzip2.  IMO the relatively small difference in
> compressed size doesn't justify the effort of building in-process
> support for multiple compression algorithms.

Agree.

>  I recommend that
> distributions just use gzip;

I don't agree here.  gzip vs man source is 5M vs 9M.  However, a
simple pipeline searching for a word in gzip pages takes ~114x the
time it takes to perform the same search on man(7) source.  I don't
think that small benefit in size doesn't justify the slowness.

Of course, this is only about theoretical maximum performance.
Current man(1) has other issues so it doesn't benefit from this
performance advantage.


> but if distributions _really_ want to use
> something else for whatever reason, then perhaps they should contribute
> code to man-db to ensure similar performance to gzip.  I'm happy to give
> pointers if there's a sufficiently compelling reason to make it worth
> the effort.
> 
>> -  man-db's man(1) is slower with plain man(7) source than with .gz
>>    pages for some misterious reason.
> 
> Maybe CPU is sufficiently cheaper than I/O that the fact of reading less
> data from disk dominates.

My CPU is powerful, but so is my SSD.  I wouldn't expect decompressing
to be faster than I/O.  I have a Samsung 960 PRO, which is quite fast[1].

$ lscpu
[...]
  Model name:            Intel(R) Core(TM) i7-5775C CPU @ 3.30GHz
    CPU family:          6
    Model:               71
    Thread(s) per core:  1
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            1
    CPU(s) scaling MHz:  44%
    CPU max MHz:         3700.0000
    CPU min MHz:         800.0000
[...]
Caches (sum of all):     
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    1 MiB (4 instances)
  L3:                    6 MiB (1 instance)
  L4:                    128 MiB (1 instance)
[...]

$ lspci | grep -i samsung
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961/SM963

$ lsblk -o NAME,FSTYPE,MOUNTPOINT,SIZE,MODEL
NAME                                FSTYPE   MOUNTPOINT              SIZE MODEL
[...]
nvme0n1                                                            953.9G Samsung SSD 960 PRO
├─nvme0n1p1                         vfat     /boot/efi              1023M 
├─nvme0n1p2                         ext4     /boot                     4G 
└─nvme0n1p3                         crypto_L                         948G 
  └─nvme0n1p3_crypt                 ext4     /                       948G


Also, a manual loop should have similar problems, but it doesn't have
them; if I loop manually over the files and grep them, it takes 0.01 s,
which is the lowest that /bin/time can measure on my system.


I repeated the tests on a tmpfs just to check.  The times are almost the
same (except that bzip goes down from 10 s to 9 s :).


$ mount | grep /tmp
tmpfs on /tmp type tmpfs (rw,noatime,inode64)
$ sudo rm -r /tmp/man
$ sudo make install-man prefix=/tmp/man/gz_ -j LINK_PAGES=symlink Z=.gz | wc -l
2570
$ sudo make install-man prefix=/tmp/man/bz2 -j LINK_PAGES=symlink Z=.bz2 | wc -l
2570
$ sudo make install-man prefix=/tmp/man/man -j LINK_PAGES=symlink Z= | wc -l
2570
$ du -sh /tmp/man/*
5.3M	/tmp/man/bz2
5.4M	/tmp/man/gz_
9.3M	/tmp/man/man


$ export MANPATH=/tmp/man/gz_/share/man
$ /bin/time -f %e dash -c "man -Kaw RLIMIT_NOFILE | wc -l"
37
0.30
$ /bin/time -f %e dash -c "find $MANPATH -type f | while read f; do gzip -d - <\$f | grep -l RLIMIT_NOFILE >/dev/null && echo \$f; done | wc -l"
17
1.14


This is quite optimized.  I can't beat man(1) with a shell pipeline
for .gz pages.  :)


$ export MANPATH=/tmp/man/bz2/share/man
$ /bin/time -f %e dash -c "man -Kaw RLIMIT_NOFILE | wc -l"
37
9.22
$ /bin/time -f %e dash -c "find $MANPATH -type f | while read f; do bzip2 -d - <\$f | grep -l RLIMIT_NOFILE >/dev/null && echo \$f; done | wc -l"
17
1.22


Sam, really consider not using .bz2 for Gentoo's pages.  :)


$ export MANPATH=/tmp/man/man/share/man
$ /bin/time -f %e dash -c "man -Kaw RLIMIT_NOFILE | wc -l"
37
0.52
$ /bin/time -f %e dash -c "find $MANPATH -type f | xargs grep -l RLIMIT_NOFILE | wc -l"
17
0.01


man(1) is ~52x slower than my loop.  Similar results from RAM and NVMe,
so I/O is not the issue here.

> 
> 
> (Can I request that any concrete actions that need to be taken based on
> this thread be split out to separate bug reports or something, please?
> This thread is long and I don't really want to have lots of meandering
> discourse in my inbox going back over the tired old man vs. info debate
> or whatever, but if there are actual things I need to fix in man-db then
> I'd rather not miss them.)

Sure; do you have a mailing list, or should I send them to you and CC
linux-man@?  I have at least one bug report for you.

Cheers,
Alex

[1]:  <https://www.anandtech.com/show/10754/samsung-960-pro-ssd-review>

-- 
<http://www.alejandro-colomar.es/>
GPG key fingerprint: A9348594CE31283A826FBDD8D57633D441E25BB5

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2023-04-09 13:36 UTC|newest]

Thread overview: 73+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-25 20:37 Playground pager lsp(1) Dirk Gouders
2023-03-25 20:47 ` Dirk Gouders
2023-04-04 23:45   ` Alejandro Colomar
2023-04-05  5:35     ` Eli Zaretskii
2023-04-06  1:10       ` Alejandro Colomar
2023-04-06  8:11         ` Eli Zaretskii
2023-04-06  8:48           ` Gavin Smith
2023-04-07 22:01           ` Alejandro Colomar
2023-04-08  7:05             ` Eli Zaretskii
2023-04-08 13:02               ` Accessibility of man pages (was: Playground pager lsp(1)) Alejandro Colomar
2023-04-08 13:42                 ` Eli Zaretskii
2023-04-08 16:06                   ` Alejandro Colomar
2023-04-08 13:47                 ` Colin Watson
2023-04-08 15:42                   ` Alejandro Colomar
2023-04-08 19:48                   ` Accessibility of man pages Dirk Gouders
2023-04-08 20:02                     ` Eli Zaretskii
2023-04-08 20:46                       ` Dirk Gouders
2023-04-08 21:53                         ` Alejandro Colomar
2023-04-08 22:33                           ` Alejandro Colomar
2023-04-09 10:28                       ` Ralph Corderoy
2023-04-08 20:31                     ` Ingo Schwarze
2023-04-08 20:59                       ` Dirk Gouders
2023-04-08 22:39                         ` Ingo Schwarze
2023-04-09  9:50                           ` Dirk Gouders
2023-04-09 10:35                             ` Dirk Gouders
     [not found]                 ` <87a5zhwntt.fsf@ada>
2023-04-09 12:05                   ` Compressed man pages (was: Accessibility of man pages (was: Playground pager lsp(1))) Alejandro Colomar
2023-04-09 12:17                     ` Alejandro Colomar
2023-04-09 18:55                       ` G. Branden Robinson
2023-04-09 12:29                     ` Colin Watson
2023-04-09 13:36                       ` Alejandro Colomar [this message]
2023-04-09 13:47                         ` Compressed man pages Ralph Corderoy
2023-04-12  8:13                     ` Compressed man pages (was: Accessibility of man pages (was: Playground pager lsp(1))) Sam James
2023-04-12  8:32                       ` Compressed man pages Ralph Corderoy
2023-04-12 10:35                         ` Mingye Wang
2023-04-12 10:55                           ` Ralph Corderoy
2023-04-12 13:04                       ` Compressed man pages (was: Accessibility of man pages (was: Playground pager lsp(1))) Kerin Millar
2023-04-12 14:24                         ` Alejandro Colomar
2023-04-12 18:52                           ` Mingye Wang
2023-04-12 20:23                             ` Compressed man pages Alejandro Colomar
2023-04-13 10:09                             ` Ralph Corderoy
2023-04-07  2:18         ` Playground pager lsp(1) G. Branden Robinson
2023-04-07  6:36           ` Eli Zaretskii
2023-04-07 11:03             ` Gavin Smith
2023-04-07 14:43             ` man page rendering speed (was: Playground pager lsp(1)) G. Branden Robinson
2023-04-07 15:06               ` Eli Zaretskii
2023-04-07 15:08                 ` Larry McVoy
2023-04-07 17:07                 ` man page rendering speed Ingo Schwarze
2023-04-07 19:04                 ` man page rendering speed (was: Playground pager lsp(1)) Alejandro Colomar
2023-04-07 19:28                   ` Gavin Smith
2023-04-07 20:43                     ` Alejandro Colomar
2023-04-07 16:08               ` Colin Watson
2023-04-08 11:24               ` Ralph Corderoy
2023-04-07 21:26           ` reformatting man pages at SIGWINCH " Alejandro Colomar
2023-04-07 22:09             ` reformatting man pages at SIGWINCH Dirk Gouders
2023-04-07 22:16               ` Alejandro Colomar
2023-04-10 19:05                 ` Dirk Gouders
2023-04-10 19:57                   ` Alejandro Colomar
2023-04-10 20:24                   ` G. Branden Robinson
2023-04-11  9:20                     ` Ralph Corderoy
2023-04-11  9:39                     ` Dirk Gouders
2023-04-17  6:23                       ` G. Branden Robinson
2023-04-08 11:40               ` Ralph Corderoy
2023-04-05 10:02     ` Playground pager lsp(1) Dirk Gouders
2023-04-05 14:19       ` Arsen Arsenović
2023-04-05 18:01         ` Dirk Gouders
2023-04-05 19:07           ` Eli Zaretskii
2023-04-05 19:56             ` Dirk Gouders
2023-04-05 20:38             ` A less presumptive .info? (was: Re: Playground pager lsp(1)) Arsen Arsenović
2023-04-06  8:14               ` Eli Zaretskii
2023-04-06  8:56                 ` Gavin Smith
2023-04-07 13:14                 ` Arsen Arsenović
2023-04-06  1:31       ` Playground pager lsp(1) Alejandro Colomar
2023-04-06  6:01         ` Dirk Gouders

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bddac44c-4495-4323-4051-e8ec083b62af@gmail.com \
    --to=alx.manpages@gmail.com \
    --cc=cjwatson@debian.org \
    --cc=dirk@gouders.net \
    --cc=flexibeast@gmail.com \
    --cc=groff@gnu.org \
    --cc=linux-man@vger.kernel.org \
    --cc=ralph@inputplus.co.uk \
    --cc=sam@gentoo.org \
    --cc=schwarze@usta.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).