[PATCH v5 0/3]: lib/lzo: run-length encoding support

* [PATCH v5 0/3]: lib/lzo: run-length encoding support
@ 2019-02-05 15:59 Dave Rodgman
  2019-02-05 16:00 ` [PATCH v5 1/3] lib/lzo: implement run-length encoding Dave Rodgman
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Dave Rodgman @ 2019-02-05 15:59 UTC (permalink / raw)
  To: linux-kernel, Matt Sealey, Dave Rodgman, davem, gregkh, herbert,
	markus, minchan, nitingupta910, rpurdie, sergey.senozhatsky.work,
	sonnyrao, akpm, sfr
  Cc: nd

Hi,

Following on from the previous lzo-rle patchset:

https://lkml.org/lkml/2018/11/30/972

This patchset contains only the RLE patches, and should be applied on top of
the non-RLE patches ( https://lkml.org/lkml/2019/2/5/366 ).

Previously, some questions were raised around the RLE patches. I've done some
additional benchmarking to answer these questions. In short:

 - RLE offers significant additional performance (data-dependent)
 - I didn't measure any regressions that were clearly outside the noise

One concern with this patchset was around performance - specifically, measuring
RLE impact separately from Matt Sealey's patches (CTZ & fast copy). I have done
some additional benchmarking which I hope clarifies the benefits of each part
of the patchset.

Firstly, I've captured some memory via /dev/fmem from a Chromebook with many
tabs open which is starting to swap, and then split this into 4178 4k pages.
I've excluded the all-zero pages (as zram does), and also the no-zero pages
(which won't tell us anything about RLE performance). This should give a
realistic test dataset for zram. What I found was that the data is VERY
bimodal: 44% of pages in this dataset contain 5% or fewer zeros, and 44%
contain over 90% zeros (30% if you include the no-zero pages). This supports
the idea of special-casing zeros in zram.

Next, I've benchmarked four variants of lzo on these pages (on 64-bit Arm at
max frequency): baseline LZO; baseline + Matt Sealey's patches (aka MS);
baseline + RLE only; baseline + MS + RLE. Numbers are for weighted roundtrip
throughput (the weighting reflects that zram does more compression than
decompression).

https://drive.google.com/file/d/1VLtLjRVxgUNuWFOxaGPwJYhl_hMQXpHe/view?usp=sharing

Matt's patches help in all cases for Arm (and no effect on Intel), as expected.

RLE also behaves as expected: with few zeros present, it makes no difference;
above ~75%, it gives a good improvement (50 - 300 MB/s on top of the benefit
from Matt's patches).

Best performance is seen with both MS and RLE patches.

Finally, I have benchmarked the same dataset on an x86-64 device. Here, the
MS patches make no difference (as expected); RLE helps, similarly as on Arm.
There were no definite regressions; allowing for observational error, 0.1%
(3/4178) of cases had a regression > 1 standard deviation, of which the largest
was 4.6% (1.2 standard deviations). I think this is probably within the noise.

https://drive.google.com/file/d/1xCUVwmiGD0heEMx5gcVEmLBI4eLaageV/view?usp=sharing

One point to note is that the graphs show RLE appears to help very slightly
with no zeros present! This is because the extra code causes the clang
optimiser to change code layout in a way that happens to have a significant
benefit. Taking baseline LZO and adding a do-nothing line like
"__builtin_prefetch(out_len);" immediately before the "goto next" has the same
effect. So this is a real, but basically spurious effect - it's small enough
not to upset the overall findings.

Dave

^ permalink raw reply	[flat|nested] 6+ messages in thread