rust-for-linux.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/11] Rust null block driver
@ 2023-05-03  9:06 Andreas Hindborg
  2023-05-03  9:06 ` [RFC PATCH 01/11] rust: add radix tree abstraction Andreas Hindborg
                   ` (16 more replies)
  0 siblings, 17 replies; 69+ messages in thread
From: Andreas Hindborg @ 2023-05-03  9:06 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block
  Cc: Andreas Hindborg, Matthew Wilcox, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	Benno Lossin, Andreas Hindborg, open list, gost.dev

From: Andreas Hindborg <a.hindborg@samsung.com>

Hi All!

This is an early preview of a null block driver written in Rust. It is a follow
up to my LSF topic proposal [1]. I send this series with intention of collecting
feedback and comments at LSF/MM/BPF.

The patches apply on top of the `rust-next` tree (rust-6.4
ea76e08f4d901a450619831a255e9e0a4c0ed162) [4].

A null block driver is a good opportunity to evaluate Rust bindings for the
block layer. It is a small and simple driver and thus should be simple to reason
about. Further, the null block driver is not usually deployed in production
environments. Thus, it should be fairly straight forward to review, and any
potential issues are not going to bring down any production workloads.

Being small and simple, the null block driver is a good place to introduce the
Linux kernel storage community to Rust. This will help prepare the community for
future Rust projects and facilitate a better maintenance process for these
projects.

The statistics presented in my previous message [1] show that the C null block
driver has had a significant amount of memory safety related problems in the
past. 41% of fixes merged for the C null block driver are fixes for memory
safety issues. This makes the null block driver a good candidate for rewriting
in Rust.

Patches in this series prefixed with capitalized RUST are code taken from the
R4L `rust` tree [2]. This code is not yet in mainline Linux and has not yet been
submitted as patches for inclusion in Linux. I am not the original author of
these patches, except for small changes required to include them in this series.

The driver performs similarly to the C driver for the implemented features, see
table below [5].

In this table each cell shows the relative performance of the Rust
driver to the C driver with the throughput of the C driver in parenthesis:
`rel_read rel_write (read_miops write_miops)`. Ex: the Rust driver performs 4.74
percent better than the C driver for fio randread with 2 jobs at 16 KiB
block size.

Over the 432 benchmark configurations, the relative performance of the Rust
driver to the C driver (in terms of IOPS) is between 6.8 and -11.8 percent with
an average of 0.2 percent better for reads. For writes the span is 16.8 to -4.5
percent with an average of 0.9 percent better.

For each measurement the drivers are loaded, a drive is configured with memory
backing and a size of 4 GiB. C null_blk is configured to match the implemented
modes of the Rust driver: `blocksize` is set to 4 KiB, `completion_nsec` to 0,
`irqmode` to 0 (IRQ_NONE), `queue_mode` to 2 (MQ), `hw_queue_depth` to 256 and
`memory_backed` to 1. For both the drivers, the queue scheduler is set to
`none`. These measurements are made using 30 second runs of `fio` with the
`PSYNC` IO engine with workers pinned to separate CPU cores. The measurements
are done inside a virtual machine (qemu/kvm) on an Intel Alder Lake workstation
(i5-12600).

Given the relatively large spread in these numbers, it is possible that there is
quite a bit of noise in the results. If anybody is able to dedicate some CPU
time to run benchmarks in a more controlled and standardized environment, I
would appreciate it.

The feature list of the driver is relatively short for now:

 - `blk-mq` support
 - Direct completion of IO
 - Read and write requests
 - Optional memory backing

Features available in the C null_blk driver that are currently not implemented
in this work:

 - Bio-based submission
 - NUMA support
 - Block size configuration
 - Multiple devices
 - Dynamic device creation/destruction
 - Soft-IRQ and timer mode
 - Queue depth configuration
 - Queue count configuration
 - Discard operation support
 - Cache emulation
 - Bandwidth throttling
 - Per node hctx
 - IO scheduler configuration
 - Blocking submission mode
 - Shared tags configuration (for >1 device)
 - Zoned storage support
 - Bad block simulation
 - Poll queues

The driver is implemented entirely in safe Rust, with all unsafe code fully
contained in wrappers for C APIs.

Patches 07/11 and 08/11 that enable the equivalent of the C
`spin_lock_irqsave()` did appear on the Rust list [6,3] but was not applied to
`rust-next` tree yet [4]. I am not the author of these patches, but I include
them here for convenience. Please defer any discussion of these patches to their
original posts.

The null block driver is provided in patch 10/11. Patches 03/11 and 04/11
introduce Rust abstractions for the block C APIs required by the driver.


[1] https://lore.kernel.org/all/87y1ofj5tt.fsf@metaspace.dk/
[2] https://github.com/Rust-for-Linux/linux/tree/rust
[3] https://lore.kernel.org/all/20230411054543.21278-7-wedsonaf@gmail.com/
[4] https://github.com/Rust-for-Linux/linux/tree/rust-next
[6] https://lore.kernel.org/all/20230411054543.21278-6-wedsonaf@gmail.com/

[5]:
+---------+-----------+-------------------------------+-----------------------------+-------------------------------+
| jobs/bs |  workload |               1               |              2              |               3               |
+---------+-----------+-------------------------------+-----------------------------+-------------------------------+
|    4k   |  randread |      1.54 0.00 (0.7,0.0)      |     0.35 0.00 (1.3,0.0)     |      -0.52 0.00 (1.3,0.0)     |
|    8k   |  randread |      3.30 0.00 (0.57,0.0)     |     6.79 0.00 (0.92,0.0)    |      0.24 0.00 (0.97,0.0)     |
|   16k   |  randread |      0.89 0.00 (0.42,0.0)     |     4.74 0.00 (0.64,0.0)    |      3.98 0.00 (0.65,0.0)     |
|   32k   |  randread |      2.04 0.00 (0.27,0.0)     |     3.26 0.00 (0.37,0.0)    |      1.41 0.00 (0.37,0.0)     |
|   64k   |  randread |      1.35 0.00 (0.16,0.0)     |     0.71 0.00 (0.21,0.0)    |      0.66 0.00 (0.22,0.0)     |
|   128k  |  randread |      0.59 0.00 (0.09,0.0)     |     0.13 0.00 (0.1,0.0)     |     -6.47 0.00 (0.11,0.0)     |
|   256k  |  randread |     0.13 0.00 (0.048,0.0)     |    1.00 0.00 (0.055,0.0)    |     0.03 0.00 (0.059,0.0)     |
|   512k  |  randread |     1.16 0.00 (0.023,0.0)     |    -1.82 0.00 (0.027,0.0)   |     -0.90 0.00 (0.028,0.0)    |
|  1024k  |  randread |     0.14 0.00 (0.011,0.0)     |    0.58 0.00 (0.012,0.0)    |     -1.14 0.00 (0.013,0.0)    |
|  2048k  |  randread |     2.02 0.00 (0.005,0.0)     |    0.13 0.00 (0.0062,0.0)   |    -0.37 0.00 (0.0062,0.0)    |
|  4096k  |  randread |     1.21 0.00 (0.0026,0.0)    |   -5.68 0.00 (0.0031,0.0)   |    -2.39 0.00 (0.0029,0.0)    |
|  8192k  |  randread |    -1.02 0.00 (0.0013,0.0)    |    0.66 0.00 (0.0013,0.0)   |    -1.72 0.00 (0.0013,0.0)    |
|    4k   |   randrw  |     2.82 2.82 (0.25,0.25)     |     2.94 2.94 (0.3,0.3)     |     3.19 3.20 (0.34,0.34)     |
|    8k   |   randrw  |     2.80 2.82 (0.21,0.21)     |    2.32 2.32 (0.25,0.25)    |     1.95 1.96 (0.28,0.28)     |
|   16k   |   randrw  |     4.54 4.52 (0.16,0.16)     |     2.29 2.29 (0.2,0.2)     |     2.27 2.30 (0.21,0.21)     |
|   32k   |   randrw  |     3.51 3.51 (0.11,0.11)     |    2.43 2.40 (0.13,0.13)    |     1.38 1.38 (0.14,0.14)     |
|   64k   |   randrw  |    2.40 2.42 (0.069,0.069)    |    1.98 1.99 (0.08,0.08)    |    0.51 0.51 (0.082,0.082)    |
|   128k  |   randrw  |    1.50 1.49 (0.039,0.039)    |   1.28 1.27 (0.042,0.042)   |    0.83 0.84 (0.043,0.043)    |
|   256k  |   randrw  |    2.02 2.06 (0.021,0.021)    |   0.92 0.91 (0.022,0.022)   |    0.70 0.75 (0.022,0.022)    |
|   512k  |   randrw  |    0.82 0.82 (0.011,0.011)    |   0.69 0.70 (0.011,0.011)   |    0.97 0.98 (0.011,0.011)    |
|  1024k  |   randrw  |   3.78 3.84 (0.0046,0.0046)   |  0.98 0.98 (0.0053,0.0053)  |   1.30 1.29 (0.0054,0.0054)   |
|  2048k  |   randrw  |   1.60 1.54 (0.0023,0.0023)   |  1.76 1.69 (0.0026,0.0026)  |   1.23 0.92 (0.0026,0.0026)   |
|  4096k  |   randrw  |   1.18 1.10 (0.0011,0.0012)   |  1.32 1.20 (0.0014,0.0014)  |   1.49 1.41 (0.0013,0.0013)   |
|  8192k  |   randrw  | -2.01 -1.89 (0.00057,0.00057) | 1.20 1.14 (0.00061,0.00061) |  -0.74 -0.61 (0.0006,0.00062) |
|    4k   | randwrite |      0.00 2.48 (0.0,0.41)     |     0.00 3.61 (0.0,0.39)    |      0.00 3.93 (0.0,0.44)     |
|    8k   | randwrite |      0.00 3.46 (0.0,0.35)     |     0.00 1.90 (0.0,0.34)    |      0.00 2.20 (0.0,0.37)     |
|   16k   | randwrite |      0.00 3.01 (0.0,0.28)     |     0.00 2.86 (0.0,0.27)    |      0.00 2.25 (0.0,0.29)     |
|   32k   | randwrite |      0.00 3.02 (0.0,0.19)     |     0.00 2.68 (0.0,0.19)    |      0.00 2.20 (0.0,0.21)     |
|   64k   | randwrite |      0.00 3.39 (0.0,0.12)     |     0.00 3.75 (0.0,0.12)    |     0.00 -0.79 (0.0,0.14)     |
|   128k  | randwrite |     0.00 3.81 (0.0,0.069)     |    0.00 1.10 (0.0,0.072)    |     0.00 5.48 (0.0,0.078)     |
|   256k  | randwrite |     0.00 4.28 (0.0,0.038)     |    0.00 1.27 (0.0,0.037)    |     0.00 -0.10 (0.0,0.041)    |
|   512k  | randwrite |     0.00 2.71 (0.0,0.019)     |    0.00 0.46 (0.0,0.017)    |     0.00 2.96 (0.0,0.019)     |
|  1024k  | randwrite |     0.00 2.67 (0.0,0.0082)    |    0.00 2.54 (0.0,0.0087)   |     0.00 0.20 (0.0,0.0093)    |
|  2048k  | randwrite |     0.00 2.40 (0.0,0.0041)    |    0.00 1.90 (0.0,0.0041)   |     0.00 3.44 (0.0,0.0043)    |
|  4096k  | randwrite |     0.00 3.38 (0.0,0.002)     |    0.00 1.69 (0.0,0.0025)   |     0.00 6.00 (0.0,0.0025)    |
|  8192k  | randwrite |    0.00 3.09 (0.0,0.00098)    |    0.00 1.10 (0.0,0.0012)   |     0.00 0.45 (0.0,0.0012)    |
|    4k   |    read   |      1.26 0.00 (1.0,0.0)      |     1.65 0.00 (2.0,0.0)     |      -9.94 0.00 (2.8,0.0)     |
|    8k   |    read   |      1.86 0.00 (0.78,0.0)     |     4.60 0.00 (1.5,0.0)     |      0.47 0.00 (1.7,0.0)      |
|   16k   |    read   |      1.13 0.00 (0.54,0.0)     |     4.00 0.00 (0.95,0.0)    |     -10.28 0.00 (0.97,0.0)    |
|   32k   |    read   |      0.64 0.00 (0.32,0.0)     |     5.54 0.00 (0.46,0.0)    |     -0.67 0.00 (0.47,0.0)     |
|   64k   |    read   |      0.29 0.00 (0.18,0.0)     |    -0.08 0.00 (0.25,0.0)    |     -0.82 0.00 (0.25,0.0)     |
|   128k  |    read   |     0.11 0.00 (0.095,0.0)     |     0.12 0.00 (0.11,0.0)    |     -4.99 0.00 (0.12,0.0)     |
|   256k  |    read   |      1.66 0.00 (0.05,0.0)     |    -0.41 0.00 (0.058,0.0)   |     -0.30 0.00 (0.061,0.0)    |
|   512k  |    read   |     1.06 0.00 (0.024,0.0)     |    -1.77 0.00 (0.029,0.0)   |     0.81 0.00 (0.029,0.0)     |
|  1024k  |    read   |     0.41 0.00 (0.011,0.0)     |    -2.28 0.00 (0.013,0.0)   |     -0.33 0.00 (0.013,0.0)    |
|  2048k  |    read   |    -1.18 0.00 (0.0053,0.0)    |    1.41 0.00 (0.0063,0.0)   |    -0.86 0.00 (0.0063,0.0)    |
|  4096k  |    read   |    -0.54 0.00 (0.0027,0.0)    |   -11.77 0.00 (0.0031,0.0)  |     0.02 0.00 (0.003,0.0)     |
|  8192k  |    read   |    -1.60 0.00 (0.0013,0.0)    |   -4.25 0.00 (0.0013,0.0)   |    -1.64 0.00 (0.0013,0.0)    |
|    4k   | readwrite |     1.01 1.01 (0.34,0.34)     |    1.00 1.00 (0.42,0.42)    |     1.30 1.29 (0.45,0.44)     |
|    8k   | readwrite |     1.54 1.54 (0.28,0.28)     |    1.65 1.66 (0.32,0.32)    |     1.18 1.18 (0.36,0.36)     |
|   16k   | readwrite |      1.07 1.08 (0.2,0.2)      |    2.82 2.81 (0.24,0.24)    |     0.40 0.39 (0.26,0.26)     |
|   32k   | readwrite |     0.56 0.56 (0.13,0.13)     |   -1.31 -1.30 (0.16,0.16)   |     0.09 0.09 (0.16,0.16)     |
|   64k   | readwrite |    1.82 1.83 (0.077,0.077)    |  -0.96 -0.96 (0.091,0.091)  |   -0.99 -0.97 (0.092,0.092)   |
|   128k  | readwrite |    2.09 2.10 (0.041,0.041)    |   1.07 1.06 (0.045,0.045)   |    0.28 0.29 (0.045,0.045)    |
|   256k  | readwrite |    2.06 2.05 (0.022,0.022)    |   1.84 1.85 (0.023,0.023)   |    0.66 0.59 (0.023,0.023)    |
|   512k  | readwrite |     5.90 5.84 (0.01,0.01)     |   1.20 1.17 (0.011,0.011)   |    1.18 1.21 (0.011,0.011)    |
|  1024k  | readwrite |   1.94 1.89 (0.0047,0.0047)   |  2.43 2.48 (0.0053,0.0053)  |   2.18 2.21 (0.0054,0.0054)   |
|  2048k  | readwrite |  -0.47 -0.47 (0.0023,0.0023)  |  2.45 2.38 (0.0026,0.0026)  |   1.76 1.78 (0.0026,0.0026)   |
|  4096k  | readwrite |   1.88 1.77 (0.0011,0.0012)   |  1.37 1.17 (0.0014,0.0014)  |   1.80 1.90 (0.0013,0.0013)   |
|  8192k  | readwrite |  1.27 1.42 (0.00057,0.00057)  | 0.86 0.73 (0.00061,0.00062) | -0.99 -1.28 (0.00061,0.00062) |
|    4k   |   write   |      0.00 1.53 (0.0,0.51)     |     0.00 0.35 (0.0,0.51)    |      0.00 0.92 (0.0,0.52)     |
|    8k   |   write   |      0.00 1.04 (0.0,0.42)     |     0.00 1.01 (0.0,0.4)     |      0.00 2.90 (0.0,0.44)     |
|   16k   |   write   |      0.00 0.86 (0.0,0.32)     |     0.00 1.89 (0.0,0.35)    |     0.00 -1.38 (0.0,0.36)     |
|   32k   |   write   |      0.00 0.47 (0.0,0.21)     |    0.00 -0.52 (0.0,0.26)    |     0.00 -1.40 (0.0,0.24)     |
|   64k   |   write   |      0.00 0.30 (0.0,0.13)     |    0.00 13.34 (0.0,0.14)    |      0.00 2.01 (0.0,0.15)     |
|   128k  |   write   |     0.00 2.24 (0.0,0.073)     |    0.00 16.77 (0.0,0.076)   |     0.00 -0.36 (0.0,0.085)    |
|   256k  |   write   |     0.00 3.49 (0.0,0.039)     |    0.00 9.76 (0.0,0.039)    |     0.00 -4.52 (0.0,0.043)    |
|   512k  |   write   |      0.00 4.08 (0.0,0.02)     |    0.00 1.03 (0.0,0.017)    |     0.00 1.05 (0.0,0.019)     |
|  1024k  |   write   |     0.00 2.72 (0.0,0.0083)    |    0.00 2.89 (0.0,0.0086)   |     0.00 2.98 (0.0,0.0095)    |
|  2048k  |   write   |     0.00 2.14 (0.0,0.0041)    |    0.00 3.12 (0.0,0.0041)   |     0.00 1.84 (0.0,0.0044)    |
|  4096k  |   write   |     0.00 1.76 (0.0,0.002)     |    0.00 1.80 (0.0,0.0026)   |     0.00 6.01 (0.0,0.0025)    |
|  8192k  |   write   |    0.00 1.41 (0.0,0.00099)    |    0.00 0.98 (0.0,0.0012)   |     0.00 0.13 (0.0,0.0012)    |
+---------+-----------+-------------------------------+-----------------------------+-------------------------------+

+---------+-----------+-------------------------------+------------------------------+-------------------------------+
| jobs/bs |  workload |               4               |              5               |               6               |
+---------+-----------+-------------------------------+------------------------------+-------------------------------+
|    4k   |  randread |      1.66 0.00 (1.3,0.0)      |     -0.81 0.00 (1.3,0.0)     |      -0.50 0.00 (1.2,0.0)     |
|    8k   |  randread |      2.52 0.00 (0.98,0.0)     |     2.54 0.00 (0.95,0.0)     |      2.52 0.00 (0.92,0.0)     |
|   16k   |  randread |      0.68 0.00 (0.65,0.0)     |     3.43 0.00 (0.64,0.0)     |      5.84 0.00 (0.63,0.0)     |
|   32k   |  randread |      1.88 0.00 (0.38,0.0)     |     0.46 0.00 (0.38,0.0)     |      2.33 0.00 (0.37,0.0)     |
|   64k   |  randread |      1.28 0.00 (0.21,0.0)     |     0.64 0.00 (0.21,0.0)     |     -0.26 0.00 (0.21,0.0)     |
|   128k  |  randread |     -0.05 0.00 (0.11,0.0)     |     0.10 0.00 (0.11,0.0)     |      0.09 0.00 (0.11,0.0)     |
|   256k  |  randread |     -0.01 0.00 (0.059,0.0)    |    -1.01 0.00 (0.059,0.0)    |     -0.39 0.00 (0.059,0.0)    |
|   512k  |  randread |     -1.17 0.00 (0.029,0.0)    |    0.39 0.00 (0.028,0.0)     |     -0.33 0.00 (0.028,0.0)    |
|  1024k  |  randread |     -0.20 0.00 (0.013,0.0)    |    -0.85 0.00 (0.013,0.0)    |     -0.57 0.00 (0.013,0.0)    |
|  2048k  |  randread |    -0.07 0.00 (0.0062,0.0)    |   -0.26 0.00 (0.0061,0.0)    |     -0.15 0.00 (0.006,0.0)    |
|  4096k  |  randread |    -0.25 0.00 (0.0027,0.0)    |   -2.70 0.00 (0.0027,0.0)    |    -2.35 0.00 (0.0026,0.0)    |
|  8192k  |  randread |    -2.94 0.00 (0.0013,0.0)    |   -3.01 0.00 (0.0012,0.0)    |    -2.09 0.00 (0.0012,0.0)    |
|    4k   |   randrw  |     2.83 2.84 (0.36,0.36)     |    2.24 2.23 (0.37,0.37)     |     3.49 3.50 (0.38,0.38)     |
|    8k   |   randrw  |     1.31 1.32 (0.29,0.29)     |     2.01 2.01 (0.3,0.3)      |      2.09 2.09 (0.3,0.3)      |
|   16k   |   randrw  |     1.96 1.95 (0.22,0.22)     |    1.61 1.61 (0.22,0.22)     |     1.99 2.00 (0.22,0.22)     |
|   32k   |   randrw  |     1.01 1.03 (0.14,0.14)     |    0.58 0.59 (0.14,0.14)     |    -0.04 -0.05 (0.14,0.14)    |
|   64k   |   randrw  |   -0.19 -0.18 (0.081,0.081)   |  -0.72 -0.72 (0.081,0.081)   |   -0.50 -0.50 (0.081,0.081)   |
|   128k  |   randrw  |    0.42 0.42 (0.043,0.043)    |  -0.33 -0.36 (0.043,0.043)   |   -0.62 -0.59 (0.043,0.043)   |
|   256k  |   randrw  |    0.38 0.41 (0.022,0.022)    |  -0.35 -0.35 (0.022,0.022)   |    0.37 0.31 (0.022,0.022)    |
|   512k  |   randrw  |    0.40 0.41 (0.011,0.011)    |  -0.29 -0.26 (0.011,0.011)   |     2.76 2.48 (0.01,0.01)     |
|  1024k  |   randrw  |   1.16 1.01 (0.0052,0.0052)   |  0.74 0.72 (0.0051,0.0051)   |    1.72 1.77 (0.005,0.005)    |
|  2048k  |   randrw  |   1.09 0.96 (0.0026,0.0026)   |  1.19 1.57 (0.0026,0.0026)   |   1.49 1.21 (0.0025,0.0025)   |
|  4096k  |   randrw  |   -0.13 0.06 (0.0013,0.0013)  |  0.69 0.77 (0.0013,0.0013)   |  -0.20 -0.13 (0.0012,0.0012)  |
|  8192k  |   randrw  | -0.71 -1.27 (0.00059,0.00061) | -1.78 -2.02 (0.00059,0.0006) | -1.60 -1.25 (0.00058,0.00059) |
|    4k   | randwrite |      0.00 2.17 (0.0,0.44)     |     0.00 4.08 (0.0,0.45)     |      0.00 5.47 (0.0,0.46)     |
|    8k   | randwrite |      0.00 0.98 (0.0,0.38)     |     0.00 2.39 (0.0,0.39)     |      0.00 2.30 (0.0,0.4)      |
|   16k   | randwrite |      0.00 2.02 (0.0,0.3)      |     0.00 2.63 (0.0,0.3)      |      0.00 3.17 (0.0,0.3)      |
|   32k   | randwrite |      0.00 1.82 (0.0,0.21)     |     0.00 1.12 (0.0,0.22)     |      0.00 1.40 (0.0,0.21)     |
|   64k   | randwrite |      0.00 1.41 (0.0,0.14)     |     0.00 1.92 (0.0,0.13)     |      0.00 1.76 (0.0,0.13)     |
|   128k  | randwrite |     0.00 2.32 (0.0,0.075)     |    0.00 4.61 (0.0,0.074)     |     0.00 2.31 (0.0,0.074)     |
|   256k  | randwrite |     0.00 -0.77 (0.0,0.038)    |    0.00 2.61 (0.0,0.036)     |     0.00 2.48 (0.0,0.036)     |
|   512k  | randwrite |     0.00 1.83 (0.0,0.019)     |    0.00 1.68 (0.0,0.018)     |     0.00 2.80 (0.0,0.018)     |
|  1024k  | randwrite |     0.00 3.32 (0.0,0.0086)    |    0.00 1.20 (0.0,0.0087)    |     0.00 0.92 (0.0,0.0081)    |
|  2048k  | randwrite |     0.00 0.92 (0.0,0.0043)    |    0.00 2.95 (0.0,0.0042)    |     0.00 3.15 (0.0,0.0042)    |
|  4096k  | randwrite |     0.00 3.22 (0.0,0.0025)    |    0.00 0.50 (0.0,0.0024)    |     0.00 0.40 (0.0,0.0024)    |
|  8192k  | randwrite |    0.00 -0.57 (0.0,0.0012)    |   0.00 -0.41 (0.0,0.0011)    |    0.00 -0.36 (0.0,0.0011)    |
|    4k   |    read   |      4.34 0.00 (2.5,0.0)      |     -0.40 0.00 (2.6,0.0)     |      -8.45 0.00 (2.6,0.0)     |
|    8k   |    read   |      -1.43 0.00 (1.7,0.0)     |     0.01 0.00 (1.7,0.0)      |      -3.89 0.00 (1.7,0.0)     |
|   16k   |    read   |      1.68 0.00 (0.96,0.0)     |    -0.38 0.00 (0.94,0.0)     |      0.47 0.00 (0.91,0.0)     |
|   32k   |    read   |     -0.98 0.00 (0.48,0.0)     |    -0.09 0.00 (0.47,0.0)     |     -1.24 0.00 (0.47,0.0)     |
|   64k   |    read   |     -0.42 0.00 (0.25,0.0)     |    -0.26 0.00 (0.25,0.0)     |     -0.94 0.00 (0.25,0.0)     |
|   128k  |    read   |     -0.49 0.00 (0.12,0.0)     |     0.06 0.00 (0.12,0.0)     |     -0.59 0.00 (0.12,0.0)     |
|   256k  |    read   |     -0.89 0.00 (0.062,0.0)    |    -0.11 0.00 (0.062,0.0)    |     -0.77 0.00 (0.062,0.0)    |
|   512k  |    read   |     -0.64 0.00 (0.03,0.0)     |    -0.11 0.00 (0.03,0.0)     |      0.08 0.00 (0.03,0.0)     |
|  1024k  |    read   |     0.00 0.00 (0.013,0.0)     |    -0.05 0.00 (0.013,0.0)    |     0.42 0.00 (0.013,0.0)     |
|  2048k  |    read   |    -0.69 0.00 (0.0063,0.0)    |   -0.36 0.00 (0.0063,0.0)    |     1.12 0.00 (0.006,0.0)     |
|  4096k  |    read   |    -2.69 0.00 (0.0029,0.0)    |   -1.85 0.00 (0.0027,0.0)    |    -1.13 0.00 (0.0026,0.0)    |
|  8192k  |    read   |    -2.15 0.00 (0.0013,0.0)    |   -1.66 0.00 (0.0013,0.0)    |    -2.90 0.00 (0.0012,0.0)    |
|    4k   | readwrite |     0.32 0.32 (0.46,0.46)     |    0.68 0.69 (0.47,0.47)     |     -0.26 -0.26 (0.5,0.5)     |
|    8k   | readwrite |     0.18 0.18 (0.37,0.37)     |    0.77 0.76 (0.38,0.38)     |    -0.47 -0.48 (0.39,0.39)    |
|   16k   | readwrite |     0.24 0.25 (0.27,0.27)     |   -0.17 -0.15 (0.27,0.27)    |    -0.51 -0.52 (0.27,0.27)    |
|   32k   | readwrite |    -0.91 -0.91 (0.17,0.17)    |   -1.40 -1.37 (0.17,0.17)    |    -1.67 -1.70 (0.17,0.17)    |
|   64k   | readwrite |    -1.22 -1.21 (0.09,0.09)    |   -1.33 -1.30 (0.09,0.09)    |   -2.49 -2.50 (0.091,0.091)   |
|   128k  | readwrite |   -0.67 -0.69 (0.045,0.045)   |  -1.23 -1.24 (0.045,0.045)   |   -1.01 -0.96 (0.045,0.045)   |
|   256k  | readwrite |   -0.31 -0.35 (0.023,0.023)   |  -0.28 -0.28 (0.023,0.023)   |    0.25 0.08 (0.022,0.023)    |
|   512k  | readwrite |    0.77 0.72 (0.011,0.011)    |   0.96 0.96 (0.011,0.011)    |    0.62 1.16 (0.011,0.011)    |
|  1024k  | readwrite |   0.93 0.91 (0.0053,0.0053)   |  0.72 0.78 (0.0051,0.0051)   |  -0.48 -0.10 (0.0051,0.0051)  |
|  2048k  | readwrite |   1.84 1.77 (0.0026,0.0026)   |  0.86 0.71 (0.0026,0.0026)   |   0.31 1.37 (0.0026,0.0026)   |
|  4096k  | readwrite |   2.71 2.43 (0.0013,0.0013)   |  1.54 1.73 (0.0013,0.0013)   |  -1.18 -0.79 (0.0012,0.0013)  |
|  8192k  | readwrite |  -2.61 -1.52 (0.0006,0.00061) | -1.83 -1.27 (0.00059,0.0006) | -2.25 -1.53 (0.00059,0.00059) |
|    4k   |   write   |     0.00 -2.94 (0.0,0.53)     |    0.00 -2.23 (0.0,0.57)     |      0.00 0.93 (0.0,0.56)     |
|    8k   |   write   |     0.00 -1.78 (0.0,0.45)     |    0.00 -1.14 (0.0,0.47)     |      0.00 0.47 (0.0,0.47)     |
|   16k   |   write   |     0.00 -1.46 (0.0,0.35)     |    0.00 -1.46 (0.0,0.36)     |      0.00 0.25 (0.0,0.36)     |
|   32k   |   write   |     0.00 -1.05 (0.0,0.24)     |    0.00 -1.53 (0.0,0.24)     |     0.00 -0.74 (0.0,0.24)     |
|   64k   |   write   |     0.00 -2.78 (0.0,0.15)     |     0.00 1.39 (0.0,0.15)     |     0.00 -2.42 (0.0,0.15)     |
|   128k  |   write   |     0.00 -0.03 (0.0,0.082)    |    0.00 0.32 (0.0,0.081)     |     0.00 2.76 (0.0,0.079)     |
|   256k  |   write   |     0.00 -0.33 (0.0,0.039)    |    0.00 -0.32 (0.0,0.038)    |     0.00 -2.37 (0.0,0.038)    |
|   512k  |   write   |     0.00 4.00 (0.0,0.019)     |     0.00 6.45 (0.0,0.02)     |     0.00 2.94 (0.0,0.019)     |
|  1024k  |   write   |     0.00 1.22 (0.0,0.0088)    |    0.00 1.57 (0.0,0.0088)    |    0.00 -3.98 (0.0,0.0087)    |
|  2048k  |   write   |     0.00 0.05 (0.0,0.0043)    |    0.00 3.33 (0.0,0.0044)    |     0.00 8.69 (0.0,0.0044)    |
|  4096k  |   write   |     0.00 2.36 (0.0,0.0025)    |    0.00 1.52 (0.0,0.0024)    |     0.00 0.44 (0.0,0.0024)    |
|  8192k  |   write   |    0.00 -0.35 (0.0,0.0012)    |   0.00 -0.45 (0.0,0.0011)    |    0.00 -0.73 (0.0,0.0011)    |
+---------+-----------+-------------------------------+------------------------------+-------------------------------+


Andreas Hindborg (9):
  rust: add radix tree abstraction
  rust: add `pages` module for handling page allocation
  rust: block: introduce `kernel::block::mq` module
  rust: block: introduce `kernel::block::bio` module
  RUST: add `module_params` macro
  rust: apply cache line padding for `SpinLock`
  RUST: implement `ForeignOwnable` for `Pin`
  rust: add null block driver
  rust: inline a number of short functions

Wedson Almeida Filho (2):
  rust: lock: add support for `Lock::lock_irqsave`
  rust: lock: implement `IrqSaveBackend` for `SpinLock`

 drivers/block/Kconfig              |   4 +
 drivers/block/Makefile             |   4 +
 drivers/block/rnull-helpers.c      |  60 ++++
 drivers/block/rnull.rs             | 177 ++++++++++
 rust/bindings/bindings_helper.h    |   4 +
 rust/bindings/lib.rs               |   1 +
 rust/helpers.c                     |  91 ++++++
 rust/kernel/block.rs               |   6 +
 rust/kernel/block/bio.rs           |  93 ++++++
 rust/kernel/block/bio/vec.rs       | 181 +++++++++++
 rust/kernel/block/mq.rs            |  15 +
 rust/kernel/block/mq/gen_disk.rs   | 133 ++++++++
 rust/kernel/block/mq/operations.rs | 260 +++++++++++++++
 rust/kernel/block/mq/raw_writer.rs |  30 ++
 rust/kernel/block/mq/request.rs    |  87 +++++
 rust/kernel/block/mq/tag_set.rs    |  92 ++++++
 rust/kernel/cache_padded.rs        |  33 ++
 rust/kernel/error.rs               |   4 +
 rust/kernel/lib.rs                 |  13 +
 rust/kernel/module_param.rs        | 501 +++++++++++++++++++++++++++++
 rust/kernel/pages.rs               | 284 ++++++++++++++++
 rust/kernel/radix_tree.rs          | 156 +++++++++
 rust/kernel/sync/lock.rs           |  49 ++-
 rust/kernel/sync/lock/mutex.rs     |   2 +
 rust/kernel/sync/lock/spinlock.rs  |  50 ++-
 rust/kernel/types.rs               |  30 ++
 rust/macros/helpers.rs             |  46 ++-
 rust/macros/module.rs              | 402 +++++++++++++++++++++--
 scripts/Makefile.build             |   2 +-
 29 files changed, 2766 insertions(+), 44 deletions(-)
 create mode 100644 drivers/block/rnull-helpers.c
 create mode 100644 drivers/block/rnull.rs
 create mode 100644 rust/kernel/block.rs
 create mode 100644 rust/kernel/block/bio.rs
 create mode 100644 rust/kernel/block/bio/vec.rs
 create mode 100644 rust/kernel/block/mq.rs
 create mode 100644 rust/kernel/block/mq/gen_disk.rs
 create mode 100644 rust/kernel/block/mq/operations.rs
 create mode 100644 rust/kernel/block/mq/raw_writer.rs
 create mode 100644 rust/kernel/block/mq/request.rs
 create mode 100644 rust/kernel/block/mq/tag_set.rs
 create mode 100644 rust/kernel/cache_padded.rs
 create mode 100644 rust/kernel/module_param.rs
 create mode 100644 rust/kernel/pages.rs
 create mode 100644 rust/kernel/radix_tree.rs


base-commit: ea76e08f4d901a450619831a255e9e0a4c0ed162
-- 
2.40.0


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [RFC PATCH 01/11] rust: add radix tree abstraction
  2023-05-03  9:06 [RFC PATCH 00/11] Rust null block driver Andreas Hindborg
@ 2023-05-03  9:06 ` Andreas Hindborg
  2023-05-03 10:34   ` Benno Lossin
  2023-05-05  4:04   ` Matthew Wilcox
  2023-05-03  9:06 ` [RFC PATCH 02/11] rust: add `pages` module for handling page allocation Andreas Hindborg
                   ` (15 subsequent siblings)
  16 siblings, 2 replies; 69+ messages in thread
From: Andreas Hindborg @ 2023-05-03  9:06 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block
  Cc: Andreas Hindborg, Matthew Wilcox, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	Benno Lossin, Andreas Hindborg, open list, gost.dev

From: Andreas Hindborg <a.hindborg@samsung.com>

Add abstractions for the C radix_tree. This abstraction allows Rust code to use
the radix_tree as a map from `u64` keys to `ForeignOwnable` values.

Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com>
---
 rust/bindings/bindings_helper.h |   2 +
 rust/bindings/lib.rs            |   1 +
 rust/helpers.c                  |  22 +++++
 rust/kernel/lib.rs              |   1 +
 rust/kernel/radix_tree.rs       | 156 ++++++++++++++++++++++++++++++++
 5 files changed, 182 insertions(+)
 create mode 100644 rust/kernel/radix_tree.rs

diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
index 50e7a76d5455..52834962b94d 100644
--- a/rust/bindings/bindings_helper.h
+++ b/rust/bindings/bindings_helper.h
@@ -10,7 +10,9 @@
 #include <linux/refcount.h>
 #include <linux/wait.h>
 #include <linux/sched.h>
+#include <linux/radix-tree.h>
 
 /* `bindgen` gets confused at certain things. */
 const gfp_t BINDINGS_GFP_KERNEL = GFP_KERNEL;
+const gfp_t BINDINGS_GFP_ATOMIC = GFP_ATOMIC;
 const gfp_t BINDINGS___GFP_ZERO = __GFP_ZERO;
diff --git a/rust/bindings/lib.rs b/rust/bindings/lib.rs
index 7b246454e009..62f36a9eb1f4 100644
--- a/rust/bindings/lib.rs
+++ b/rust/bindings/lib.rs
@@ -51,4 +51,5 @@ mod bindings_helper {
 pub use bindings_raw::*;
 
 pub const GFP_KERNEL: gfp_t = BINDINGS_GFP_KERNEL;
+pub const GFP_ATOMIC: gfp_t = BINDINGS_GFP_ATOMIC;
 pub const __GFP_ZERO: gfp_t = BINDINGS___GFP_ZERO;
diff --git a/rust/helpers.c b/rust/helpers.c
index 81e80261d597..5dd5e325b7cc 100644
--- a/rust/helpers.c
+++ b/rust/helpers.c
@@ -26,6 +26,7 @@
 #include <linux/spinlock.h>
 #include <linux/sched/signal.h>
 #include <linux/wait.h>
+#include <linux/radix-tree.h>
 
 __noreturn void rust_helper_BUG(void)
 {
@@ -128,6 +129,27 @@ void rust_helper_put_task_struct(struct task_struct *t)
 }
 EXPORT_SYMBOL_GPL(rust_helper_put_task_struct);
 
+void rust_helper_init_radix_tree(struct xarray *tree, gfp_t gfp_mask)
+{
+	INIT_RADIX_TREE(tree, gfp_mask);
+}
+EXPORT_SYMBOL_GPL(rust_helper_init_radix_tree);
+
+void **rust_helper_radix_tree_iter_init(struct radix_tree_iter *iter,
+					unsigned long start)
+{
+	return radix_tree_iter_init(iter, start);
+}
+EXPORT_SYMBOL_GPL(rust_helper_radix_tree_iter_init);
+
+void **rust_helper_radix_tree_next_slot(void **slot,
+					struct radix_tree_iter *iter,
+					unsigned flags)
+{
+	return radix_tree_next_slot(slot, iter, flags);
+}
+EXPORT_SYMBOL_GPL(rust_helper_radix_tree_next_slot);
+
 /*
  * We use `bindgen`'s `--size_t-is-usize` option to bind the C `size_t` type
  * as the Rust `usize` type, so we can use it in contexts where Rust
diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
index 676995d4e460..a85cb6aae8d6 100644
--- a/rust/kernel/lib.rs
+++ b/rust/kernel/lib.rs
@@ -40,6 +40,7 @@ pub mod init;
 pub mod ioctl;
 pub mod prelude;
 pub mod print;
+pub mod radix_tree;
 mod static_assert;
 #[doc(hidden)]
 pub mod std_vendor;
diff --git a/rust/kernel/radix_tree.rs b/rust/kernel/radix_tree.rs
new file mode 100644
index 000000000000..f659ab8b017c
--- /dev/null
+++ b/rust/kernel/radix_tree.rs
@@ -0,0 +1,156 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! RadixTree abstraction.
+//!
+//! C header: [`include/linux/radix_tree.h`](../../include/linux/radix_tree.h)
+
+use crate::error::to_result;
+use crate::error::Result;
+use crate::types::ForeignOwnable;
+use crate::types::Opaque;
+use crate::types::ScopeGuard;
+use alloc::boxed::Box;
+use core::marker::PhantomData;
+use core::pin::Pin;
+
+type Key = u64;
+
+/// A map of `u64` to `ForeignOwnable`
+///
+/// # Invariants
+///
+/// - `tree` always points to a valid and initialized `struct radix_tree`.
+/// - Pointers stored in the tree are created by a call to `ForignOwnable::into_foreign()`
+pub struct RadixTree<V: ForeignOwnable> {
+    tree: Pin<Box<Opaque<bindings::xarray>>>,
+    _marker: PhantomData<V>,
+}
+
+impl<V: ForeignOwnable> RadixTree<V> {
+    /// Create a new radix tree
+    ///
+    /// Note: This function allocates memory with `GFP_ATOMIC`.
+    pub fn new() -> Result<Self> {
+        let tree = Pin::from(Box::try_new(Opaque::uninit())?);
+
+        // SAFETY: `tree` points to allocated but not initialized memory. This
+        // call will initialize the memory.
+        unsafe { bindings::init_radix_tree(tree.get(), bindings::GFP_ATOMIC) };
+
+        Ok(Self {
+            tree,
+            _marker: PhantomData,
+        })
+    }
+
+    /// Try to insert a value into the tree
+    pub fn try_insert(&mut self, key: Key, value: V) -> Result<()> {
+        // SAFETY: `self.tree` points to a valid and initialized `struct radix_tree`
+        let ret =
+            unsafe { bindings::radix_tree_insert(self.tree.get(), key, value.into_foreign() as _) };
+        to_result(ret)
+    }
+
+    /// Search for `key` in the map. Returns a reference to the associated
+    /// value if found.
+    pub fn get(&self, key: Key) -> Option<V::Borrowed<'_>> {
+        // SAFETY: `self.tree` points to a valid and initialized `struct radix_tree`
+        let item =
+            core::ptr::NonNull::new(unsafe { bindings::radix_tree_lookup(self.tree.get(), key) })?;
+
+        // SAFETY: `item` was created by a call to
+        // `ForeignOwnable::into_foreign()`. As `get_mut()` and `remove()` takes
+        // a `&mut self`, no mutable borrows for `item` can exist and
+        // `ForeignOwnable::from_foreign()` cannot be called until this borrow
+        // is dropped.
+        Some(unsafe { V::borrow(item.as_ptr()) })
+    }
+
+    /// Search for `key` in the map. Return a mutable reference to the
+    /// associated value if found.
+    pub fn get_mut(&mut self, key: Key) -> Option<MutBorrow<'_, V>> {
+        let item =
+            core::ptr::NonNull::new(unsafe { bindings::radix_tree_lookup(self.tree.get(), key) })?;
+
+        // SAFETY: `item` was created by a call to
+        // `ForeignOwnable::into_foreign()`. As `get()` takes a `&self` and
+        // `remove()` takes a `&mut self`, no borrows for `item` can exist and
+        // `ForeignOwnable::from_foreign()` cannot be called until this borrow
+        // is dropped.
+        Some(MutBorrow {
+            guard: unsafe { V::borrow_mut(item.as_ptr()) },
+            _marker: core::marker::PhantomData,
+        })
+    }
+
+    /// Search for `key` in the map. If `key` is found, the key and value is
+    /// removed from the map and the value is returned.
+    pub fn remove(&mut self, key: Key) -> Option<V> {
+        // SAFETY: `self.tree` points to a valid and initialized `struct radix_tree`
+        let item =
+            core::ptr::NonNull::new(unsafe { bindings::radix_tree_delete(self.tree.get(), key) })?;
+
+        // SAFETY: `item` was created by a call to
+        // `ForeignOwnable::into_foreign()` and no borrows to `item` can exist
+        // because this function takes a `&mut self`.
+        Some(unsafe { ForeignOwnable::from_foreign(item.as_ptr()) })
+    }
+}
+
+impl<V: ForeignOwnable> Drop for RadixTree<V> {
+    fn drop(&mut self) {
+        let mut iter = bindings::radix_tree_iter {
+            index: 0,
+            next_index: 0,
+            tags: 0,
+            node: core::ptr::null_mut(),
+        };
+
+        // SAFETY: Iter is valid as we allocated it on the stack above
+        let mut slot = unsafe { bindings::radix_tree_iter_init(&mut iter, 0) };
+        loop {
+            if slot.is_null() {
+                // SAFETY: Both `self.tree` and `iter` are valid
+                slot = unsafe { bindings::radix_tree_next_chunk(self.tree.get(), &mut iter, 0) };
+            }
+
+            if slot.is_null() {
+                break;
+            }
+
+            // SAFETY: `self.tree` is valid and iter is managed by
+            // `radix_tree_next_chunk()` and `radix_tree_next_slot()`
+            let item = unsafe { bindings::radix_tree_delete(self.tree.get(), iter.index) };
+            assert!(!item.is_null());
+
+            // SAFETY: All items in the tree are created by a call to
+            // `ForeignOwnable::into_foreign()`.
+            let _ = unsafe { V::from_foreign(item) };
+
+            // SAFETY: `self.tree` is valid and iter is managed by
+            // `radix_tree_next_chunk()` and `radix_tree_next_slot()`. Slot is
+            // not null.
+            slot = unsafe { bindings::radix_tree_next_slot(slot, &mut iter, 0) };
+        }
+    }
+}
+
+/// A mutable borrow of an object owned by a `RadixTree`
+pub struct MutBorrow<'a, V: ForeignOwnable> {
+    guard: ScopeGuard<V, fn(V)>,
+    _marker: core::marker::PhantomData<&'a mut V>,
+}
+
+impl<'a, V: ForeignOwnable> core::ops::Deref for MutBorrow<'a, V> {
+    type Target = ScopeGuard<V, fn(V)>;
+
+    fn deref(&self) -> &Self::Target {
+        &self.guard
+    }
+}
+
+impl<'a, V: ForeignOwnable> core::ops::DerefMut for MutBorrow<'a, V> {
+    fn deref_mut(&mut self) -> &mut Self::Target {
+        &mut self.guard
+    }
+}
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC PATCH 02/11] rust: add `pages` module for handling page allocation
  2023-05-03  9:06 [RFC PATCH 00/11] Rust null block driver Andreas Hindborg
  2023-05-03  9:06 ` [RFC PATCH 01/11] rust: add radix tree abstraction Andreas Hindborg
@ 2023-05-03  9:06 ` Andreas Hindborg
  2023-05-03 12:31   ` Benno Lossin
  2023-05-05  4:09   ` Matthew Wilcox
  2023-05-03  9:07 ` [RFC PATCH 03/11] rust: block: introduce `kernel::block::mq` module Andreas Hindborg
                   ` (14 subsequent siblings)
  16 siblings, 2 replies; 69+ messages in thread
From: Andreas Hindborg @ 2023-05-03  9:06 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block
  Cc: Andreas Hindborg, Matthew Wilcox, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	Benno Lossin, Andreas Hindborg, open list, gost.dev

From: Andreas Hindborg <a.hindborg@samsung.com>

This patch adds support for working with pages of order 0. Support for pages
with higher order is deferred. Page allocation flags are fixed in this patch.
Future work might allow the user to specify allocation flags.

This patch is a heavily modified version of code available in the rust tree [1],
primarily adding support for multiple page mapping strategies.

[1] https://github.com/rust-for-Linux/linux/tree/bc22545f38d74473cfef3e9fd65432733435b79f/rust/kernel/pages.rs

Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com>
---
 rust/helpers.c       |  31 +++++
 rust/kernel/lib.rs   |   6 +
 rust/kernel/pages.rs | 284 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 321 insertions(+)
 create mode 100644 rust/kernel/pages.rs

diff --git a/rust/helpers.c b/rust/helpers.c
index 5dd5e325b7cc..9bd9d95da951 100644
--- a/rust/helpers.c
+++ b/rust/helpers.c
@@ -27,6 +27,7 @@
 #include <linux/sched/signal.h>
 #include <linux/wait.h>
 #include <linux/radix-tree.h>
+#include <linux/highmem.h>
 
 __noreturn void rust_helper_BUG(void)
 {
@@ -150,6 +151,36 @@ void **rust_helper_radix_tree_next_slot(void **slot,
 }
 EXPORT_SYMBOL_GPL(rust_helper_radix_tree_next_slot);
 
+void *rust_helper_kmap(struct page *page)
+{
+	return kmap(page);
+}
+EXPORT_SYMBOL_GPL(rust_helper_kmap);
+
+void rust_helper_kunmap(struct page *page)
+{
+	return kunmap(page);
+}
+EXPORT_SYMBOL_GPL(rust_helper_kunmap);
+
+void *rust_helper_kmap_atomic(struct page *page)
+{
+	return kmap_atomic(page);
+}
+EXPORT_SYMBOL_GPL(rust_helper_kmap_atomic);
+
+void rust_helper_kunmap_atomic(void *address)
+{
+	kunmap_atomic(address);
+}
+EXPORT_SYMBOL_GPL(rust_helper_kunmap_atomic);
+
+struct page *rust_helper_alloc_pages(gfp_t gfp_mask, unsigned int order)
+{
+	return alloc_pages(gfp_mask, order);
+}
+EXPORT_SYMBOL_GPL(rust_helper_alloc_pages);
+
 /*
  * We use `bindgen`'s `--size_t-is-usize` option to bind the C `size_t` type
  * as the Rust `usize` type, so we can use it in contexts where Rust
diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
index a85cb6aae8d6..8bef6686504b 100644
--- a/rust/kernel/lib.rs
+++ b/rust/kernel/lib.rs
@@ -38,6 +38,7 @@ mod build_assert;
 pub mod error;
 pub mod init;
 pub mod ioctl;
+pub mod pages;
 pub mod prelude;
 pub mod print;
 pub mod radix_tree;
@@ -57,6 +58,11 @@ pub use uapi;
 #[doc(hidden)]
 pub use build_error::build_error;
 
+/// Page size defined in terms of the `PAGE_SHIFT` macro from C.
+///
+/// [`PAGE_SHIFT`]: ../../../include/asm-generic/page.h
+pub const PAGE_SIZE: u32 = 1 << bindings::PAGE_SHIFT;
+
 /// Prefix to appear before log messages printed from within the `kernel` crate.
 const __LOG_PREFIX: &[u8] = b"rust_kernel\0";
 
diff --git a/rust/kernel/pages.rs b/rust/kernel/pages.rs
new file mode 100644
index 000000000000..ed51b053dd5d
--- /dev/null
+++ b/rust/kernel/pages.rs
@@ -0,0 +1,284 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! Kernel page allocation and management.
+//!
+//! This module currently provides limited support. It supports pages of order 0
+//! for most operations. Page allocation flags are fixed.
+
+use crate::{bindings, error::code::*, error::Result, PAGE_SIZE};
+use core::{marker::PhantomData, ptr};
+
+/// A set of physical pages.
+///
+/// `Pages` holds a reference to a set of pages of order `ORDER`. Having the order as a generic
+/// const allows the struct to have the same size as a pointer.
+///
+/// # Invariants
+///
+/// The pointer `Pages::pages` is valid and points to 2^ORDER pages.
+pub struct Pages<const ORDER: u32> {
+    pub(crate) pages: *mut bindings::page,
+}
+
+impl<const ORDER: u32> Pages<ORDER> {
+    /// Allocates a new set of contiguous pages.
+    pub fn new() -> Result<Self> {
+        let pages = unsafe {
+            bindings::alloc_pages(
+                bindings::GFP_KERNEL | bindings::__GFP_ZERO | bindings::___GFP_HIGHMEM,
+                ORDER,
+            )
+        };
+        if pages.is_null() {
+            return Err(ENOMEM);
+        }
+        // INVARIANTS: We checked that the allocation above succeeded.
+        // SAFETY: We allocated pages above
+        Ok(unsafe { Self::from_raw(pages) })
+    }
+
+    /// Create a `Pages` from a raw `struct page` pointer
+    ///
+    /// # Safety
+    ///
+    /// Caller must own the pages pointed to by `ptr` as these will be freed
+    /// when the returned `Pages` is dropped.
+    pub unsafe fn from_raw(ptr: *mut bindings::page) -> Self {
+        Self { pages: ptr }
+    }
+}
+
+impl Pages<0> {
+    #[inline(always)]
+    fn check_offset_and_map<I: MappingInfo>(
+        &self,
+        offset: usize,
+        len: usize,
+    ) -> Result<PageMapping<'_, I>>
+    where
+        Pages<0>: MappingActions<I>,
+    {
+        let end = offset.checked_add(len).ok_or(EINVAL)?;
+        if end as u32 > PAGE_SIZE {
+            return Err(EINVAL);
+        }
+
+        let mapping = <Self as MappingActions<I>>::map(self);
+
+        Ok(mapping)
+    }
+
+    #[inline(always)]
+    unsafe fn read_internal<I: MappingInfo>(
+        &self,
+        dest: *mut u8,
+        offset: usize,
+        len: usize,
+    ) -> Result
+    where
+        Pages<0>: MappingActions<I>,
+    {
+        let mapping = self.check_offset_and_map::<I>(offset, len)?;
+
+        unsafe { ptr::copy_nonoverlapping((mapping.ptr as *mut u8).add(offset), dest, len) };
+        Ok(())
+    }
+
+    /// Maps the pages and reads from them into the given buffer.
+    ///
+    /// # Safety
+    ///
+    /// Callers must ensure that the destination buffer is valid for the given
+    /// length. Additionally, if the raw buffer is intended to be recast, they
+    /// must ensure that the data can be safely cast;
+    /// [`crate::io_buffer::ReadableFromBytes`] has more details about it.
+    /// `dest` may not point to the source page.
+    #[inline(always)]
+    pub unsafe fn read(&self, dest: *mut u8, offset: usize, len: usize) -> Result {
+        unsafe { self.read_internal::<NormalMappingInfo>(dest, offset, len) }
+    }
+
+    /// Maps the pages and reads from them into the given buffer. The page is
+    /// mapped atomically.
+    ///
+    /// # Safety
+    ///
+    /// Callers must ensure that the destination buffer is valid for the given
+    /// length. Additionally, if the raw buffer is intended to be recast, they
+    /// must ensure that the data can be safely cast;
+    /// [`crate::io_buffer::ReadableFromBytes`] has more details about it.
+    /// `dest` may not point to the source page.
+    #[inline(always)]
+    pub unsafe fn read_atomic(&self, dest: *mut u8, offset: usize, len: usize) -> Result {
+        unsafe { self.read_internal::<AtomicMappingInfo>(dest, offset, len) }
+    }
+
+    #[inline(always)]
+    unsafe fn write_internal<I: MappingInfo>(
+        &self,
+        src: *const u8,
+        offset: usize,
+        len: usize,
+    ) -> Result
+    where
+        Pages<0>: MappingActions<I>,
+    {
+        let mapping = self.check_offset_and_map::<I>(offset, len)?;
+
+        unsafe { ptr::copy_nonoverlapping(src, (mapping.ptr as *mut u8).add(offset), len) };
+        Ok(())
+    }
+
+    /// Maps the pages and writes into them from the given buffer.
+    ///
+    /// # Safety
+    ///
+    /// Callers must ensure that the buffer is valid for the given length.
+    /// Additionally, if the page is (or will be) mapped by userspace, they must
+    /// ensure that no kernel data is leaked through padding if it was cast from
+    /// another type; [`crate::io_buffer::WritableToBytes`] has more details
+    /// about it. `src` must not point to the destination page.
+    #[inline(always)]
+    pub unsafe fn write(&self, src: *const u8, offset: usize, len: usize) -> Result {
+        unsafe { self.write_internal::<NormalMappingInfo>(src, offset, len) }
+    }
+
+    /// Maps the pages and writes into them from the given buffer. The page is
+    /// mapped atomically.
+    ///
+    /// # Safety
+    ///
+    /// Callers must ensure that the buffer is valid for the given length.
+    /// Additionally, if the page is (or will be) mapped by userspace, they must
+    /// ensure that no kernel data is leaked through padding if it was cast from
+    /// another type; [`crate::io_buffer::WritableToBytes`] has more details
+    /// about it. `src` must not point to the destination page.
+    #[inline(always)]
+    pub unsafe fn write_atomic(&self, src: *const u8, offset: usize, len: usize) -> Result {
+        unsafe { self.write_internal::<AtomicMappingInfo>(src, offset, len) }
+    }
+
+    /// Maps the page at index 0.
+    #[inline(always)]
+    pub fn kmap(&self) -> PageMapping<'_, NormalMappingInfo> {
+        let ptr = unsafe { bindings::kmap(self.pages) };
+
+        PageMapping {
+            page: self.pages,
+            ptr,
+            _phantom: PhantomData,
+            _phantom2: PhantomData,
+        }
+    }
+
+    /// Atomically Maps the page at index 0.
+    #[inline(always)]
+    pub fn kmap_atomic(&self) -> PageMapping<'_, AtomicMappingInfo> {
+        let ptr = unsafe { bindings::kmap_atomic(self.pages) };
+
+        PageMapping {
+            page: self.pages,
+            ptr,
+            _phantom: PhantomData,
+            _phantom2: PhantomData,
+        }
+    }
+}
+
+impl<const ORDER: u32> Drop for Pages<ORDER> {
+    fn drop(&mut self) {
+        // SAFETY: By the type invariants, we know the pages are allocated with the given order.
+        unsafe { bindings::__free_pages(self.pages, ORDER) };
+    }
+}
+
+/// Specifies the type of page mapping
+pub trait MappingInfo {}
+
+/// Encapsulates methods to map and unmap pages
+pub trait MappingActions<I: MappingInfo>
+where
+    Pages<0>: MappingActions<I>,
+{
+    /// Map a page into the kernel address scpace
+    fn map(pages: &Pages<0>) -> PageMapping<'_, I>;
+
+    /// Unmap a page specified by `mapping`
+    ///
+    /// # Safety
+    ///
+    /// Must only be called by `PageMapping::drop()`.
+    unsafe fn unmap(mapping: &PageMapping<'_, I>);
+}
+
+/// A type state indicating that pages were mapped atomically
+pub struct AtomicMappingInfo;
+impl MappingInfo for AtomicMappingInfo {}
+
+/// A type state indicating that pages were not mapped atomically
+pub struct NormalMappingInfo;
+impl MappingInfo for NormalMappingInfo {}
+
+impl MappingActions<AtomicMappingInfo> for Pages<0> {
+    #[inline(always)]
+    fn map(pages: &Pages<0>) -> PageMapping<'_, AtomicMappingInfo> {
+        pages.kmap_atomic()
+    }
+
+    #[inline(always)]
+    unsafe fn unmap(mapping: &PageMapping<'_, AtomicMappingInfo>) {
+        // SAFETY: An instance of `PageMapping` is created only when `kmap` succeeded for the given
+        // page, so it is safe to unmap it here.
+        unsafe { bindings::kunmap_atomic(mapping.ptr) };
+    }
+}
+
+impl MappingActions<NormalMappingInfo> for Pages<0> {
+    #[inline(always)]
+    fn map(pages: &Pages<0>) -> PageMapping<'_, NormalMappingInfo> {
+        pages.kmap()
+    }
+
+    #[inline(always)]
+    unsafe fn unmap(mapping: &PageMapping<'_, NormalMappingInfo>) {
+        // SAFETY: An instance of `PageMapping` is created only when `kmap` succeeded for the given
+        // page, so it is safe to unmap it here.
+        unsafe { bindings::kunmap(mapping.page) };
+    }
+}
+
+/// An owned page mapping. When this struct is dropped, the page is unmapped.
+pub struct PageMapping<'a, I: MappingInfo>
+where
+    Pages<0>: MappingActions<I>,
+{
+    page: *mut bindings::page,
+    ptr: *mut core::ffi::c_void,
+    _phantom: PhantomData<&'a i32>,
+    _phantom2: PhantomData<I>,
+}
+
+impl<'a, I: MappingInfo> PageMapping<'a, I>
+where
+    Pages<0>: MappingActions<I>,
+{
+    /// Return a pointer to the wrapped `struct page`
+    #[inline(always)]
+    pub fn get_ptr(&self) -> *mut core::ffi::c_void {
+        self.ptr
+    }
+}
+
+// Because we do not have Drop specialization, we have to do this dance. Life
+// would be much more simple if we could have `impl Drop for PageMapping<'_,
+// Atomic>` and `impl Drop for PageMapping<'_, NotAtomic>`
+impl<I: MappingInfo> Drop for PageMapping<'_, I>
+where
+    Pages<0>: MappingActions<I>,
+{
+    #[inline(always)]
+    fn drop(&mut self) {
+        // SAFETY: We are OK to call this because we are `PageMapping::drop()`
+        unsafe { <Pages<0> as MappingActions<I>>::unmap(self) }
+    }
+}
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC PATCH 03/11] rust: block: introduce `kernel::block::mq` module
  2023-05-03  9:06 [RFC PATCH 00/11] Rust null block driver Andreas Hindborg
  2023-05-03  9:06 ` [RFC PATCH 01/11] rust: add radix tree abstraction Andreas Hindborg
  2023-05-03  9:06 ` [RFC PATCH 02/11] rust: add `pages` module for handling page allocation Andreas Hindborg
@ 2023-05-03  9:07 ` Andreas Hindborg
  2023-05-08 12:29   ` Benno Lossin
  2023-05-03  9:07 ` [RFC PATCH 04/11] rust: block: introduce `kernel::block::bio` module Andreas Hindborg
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 69+ messages in thread
From: Andreas Hindborg @ 2023-05-03  9:07 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block
  Cc: Andreas Hindborg, Matthew Wilcox, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	Benno Lossin, Andreas Hindborg, open list, gost.dev

From: Andreas Hindborg <a.hindborg@samsung.com>

Add initial abstractions for working with blk-mq.

This patch is a maintained, refactored subset of code originally published by
Wedson Almeida Filho <wedsonaf@gmail.com> [1].

[1] https://github.com/wedsonaf/linux/tree/f2cfd2fe0e2ca4e90994f96afe268bbd4382a891/rust/kernel/blk/mq.rs

Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com>
---
 rust/bindings/bindings_helper.h    |   2 +
 rust/helpers.c                     |  22 +++
 rust/kernel/block.rs               |   5 +
 rust/kernel/block/mq.rs            |  15 ++
 rust/kernel/block/mq/gen_disk.rs   | 133 +++++++++++++++
 rust/kernel/block/mq/operations.rs | 260 +++++++++++++++++++++++++++++
 rust/kernel/block/mq/raw_writer.rs |  30 ++++
 rust/kernel/block/mq/request.rs    |  71 ++++++++
 rust/kernel/block/mq/tag_set.rs    |  92 ++++++++++
 rust/kernel/error.rs               |   4 +
 rust/kernel/lib.rs                 |   1 +
 11 files changed, 635 insertions(+)
 create mode 100644 rust/kernel/block.rs
 create mode 100644 rust/kernel/block/mq.rs
 create mode 100644 rust/kernel/block/mq/gen_disk.rs
 create mode 100644 rust/kernel/block/mq/operations.rs
 create mode 100644 rust/kernel/block/mq/raw_writer.rs
 create mode 100644 rust/kernel/block/mq/request.rs
 create mode 100644 rust/kernel/block/mq/tag_set.rs

diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
index 52834962b94d..86c07eeb1ba1 100644
--- a/rust/bindings/bindings_helper.h
+++ b/rust/bindings/bindings_helper.h
@@ -11,6 +11,8 @@
 #include <linux/wait.h>
 #include <linux/sched.h>
 #include <linux/radix-tree.h>
+#include <linux/blk_types.h>
+#include <linux/blk-mq.h>
 
 /* `bindgen` gets confused at certain things. */
 const gfp_t BINDINGS_GFP_KERNEL = GFP_KERNEL;
diff --git a/rust/helpers.c b/rust/helpers.c
index 9bd9d95da951..a59341084774 100644
--- a/rust/helpers.c
+++ b/rust/helpers.c
@@ -18,6 +18,7 @@
  * accidentally exposed.
  */
 
+#include <linux/bio.h>
 #include <linux/bug.h>
 #include <linux/build_bug.h>
 #include <linux/err.h>
@@ -28,6 +29,8 @@
 #include <linux/wait.h>
 #include <linux/radix-tree.h>
 #include <linux/highmem.h>
+#include <linux/blk-mq.h>
+#include <linux/blkdev.h>
 
 __noreturn void rust_helper_BUG(void)
 {
@@ -130,6 +133,25 @@ void rust_helper_put_task_struct(struct task_struct *t)
 }
 EXPORT_SYMBOL_GPL(rust_helper_put_task_struct);
 
+struct bio_vec rust_helper_req_bvec(struct request *rq)
+{
+	return req_bvec(rq);
+}
+EXPORT_SYMBOL_GPL(rust_helper_req_bvec);
+
+void *rust_helper_blk_mq_rq_to_pdu(struct request *rq)
+{
+	return blk_mq_rq_to_pdu(rq);
+}
+EXPORT_SYMBOL_GPL(rust_helper_blk_mq_rq_to_pdu);
+
+void rust_helper_bio_advance_iter_single(const struct bio *bio,
+                                         struct bvec_iter *iter,
+                                         unsigned int bytes) {
+  bio_advance_iter_single(bio, iter, bytes);
+}
+EXPORT_SYMBOL_GPL(rust_helper_bio_advance_iter_single);
+
 void rust_helper_init_radix_tree(struct xarray *tree, gfp_t gfp_mask)
 {
 	INIT_RADIX_TREE(tree, gfp_mask);
diff --git a/rust/kernel/block.rs b/rust/kernel/block.rs
new file mode 100644
index 000000000000..4c93317a568a
--- /dev/null
+++ b/rust/kernel/block.rs
@@ -0,0 +1,5 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! Types for working with the block layer
+
+pub mod mq;
diff --git a/rust/kernel/block/mq.rs b/rust/kernel/block/mq.rs
new file mode 100644
index 000000000000..5b40f6a73c0f
--- /dev/null
+++ b/rust/kernel/block/mq.rs
@@ -0,0 +1,15 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! This module provides types for implementing drivers that interface the
+//! blk-mq subsystem
+
+mod gen_disk;
+mod operations;
+mod raw_writer;
+mod request;
+mod tag_set;
+
+pub use gen_disk::GenDisk;
+pub use operations::Operations;
+pub use request::Request;
+pub use tag_set::TagSet;
diff --git a/rust/kernel/block/mq/gen_disk.rs b/rust/kernel/block/mq/gen_disk.rs
new file mode 100644
index 000000000000..50496af15bbf
--- /dev/null
+++ b/rust/kernel/block/mq/gen_disk.rs
@@ -0,0 +1,133 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! GenDisk abstraction
+//!
+//! C header: [`include/linux/blkdev.h`](../../include/linux/blkdev.h)
+//! C header: [`include/linux/blk_mq.h`](../../include/linux/blk_mq.h)
+
+use crate::block::mq::{raw_writer::RawWriter, Operations, TagSet};
+use crate::{
+    bindings, error::from_err_ptr, error::Result, sync::Arc, types::ForeignOwnable,
+    types::ScopeGuard,
+};
+use core::fmt::{self, Write};
+
+/// A generic block device
+///
+/// # Invariants
+///
+///  - `gendisk` must always point to an initialized and valid `struct gendisk`.
+pub struct GenDisk<T: Operations> {
+    _tagset: Arc<TagSet<T>>,
+    gendisk: *mut bindings::gendisk,
+}
+
+// SAFETY: `GenDisk` is an owned pointer to a `struct gendisk` and an `Arc` to a
+// `TagSet` It is safe to send this to other threads as long as T is Send.
+unsafe impl<T: Operations + Send> Send for GenDisk<T> {}
+
+impl<T: Operations> GenDisk<T> {
+    /// Try to create a new `GenDisk`
+    pub fn try_new(tagset: Arc<TagSet<T>>, queue_data: T::QueueData) -> Result<Self> {
+        let data = queue_data.into_foreign();
+        let recover_data = ScopeGuard::new(|| {
+            // SAFETY: T::QueueData was created by the call to `into_foreign()` above
+            unsafe { T::QueueData::from_foreign(data) };
+        });
+
+        let lock_class_key = crate::sync::LockClassKey::new();
+
+        // SAFETY: `tagset.raw_tag_set()` points to a valid and initialized tag set
+        let gendisk = from_err_ptr(unsafe {
+            bindings::__blk_mq_alloc_disk(tagset.raw_tag_set(), data as _, lock_class_key.as_ptr())
+        })?;
+
+        const TABLE: bindings::block_device_operations = bindings::block_device_operations {
+            submit_bio: None,
+            open: None,
+            release: None,
+            ioctl: None,
+            compat_ioctl: None,
+            check_events: None,
+            unlock_native_capacity: None,
+            getgeo: None,
+            set_read_only: None,
+            swap_slot_free_notify: None,
+            report_zones: None,
+            devnode: None,
+            alternative_gpt_sector: None,
+            get_unique_id: None,
+            owner: core::ptr::null_mut(),
+            pr_ops: core::ptr::null_mut(),
+            free_disk: None,
+            poll_bio: None,
+        };
+
+        // SAFETY: gendisk is a valid pointer as we initialized it above
+        unsafe { (*gendisk).fops = &TABLE };
+
+        recover_data.dismiss();
+        Ok(Self {
+            _tagset: tagset,
+            gendisk,
+        })
+    }
+
+    /// Set the name of the device
+    pub fn set_name(&self, args: fmt::Arguments<'_>) -> Result {
+        let mut raw_writer = RawWriter::from_array(unsafe { &mut (*self.gendisk).disk_name });
+        raw_writer.write_fmt(args)?;
+        raw_writer.write_char('\0')?;
+        Ok(())
+    }
+
+    /// Register the device with the kernel. When this function return, the
+    /// device is accessible from VFS. The kernel may issue reads to the device
+    /// during registration to discover partition infomation.
+    pub fn add(&self) -> Result {
+        crate::error::to_result(unsafe {
+            bindings::device_add_disk(core::ptr::null_mut(), self.gendisk, core::ptr::null_mut())
+        })
+    }
+
+    /// Call to tell the block layer the capcacity of the device
+    pub fn set_capacity(&self, sectors: u64) {
+        unsafe { bindings::set_capacity(self.gendisk, sectors) };
+    }
+
+    /// Set the logical block size of the device
+    pub fn set_queue_logical_block_size(&self, size: u32) {
+        unsafe { bindings::blk_queue_logical_block_size((*self.gendisk).queue, size) };
+    }
+
+    /// Set the physical block size of the device
+    pub fn set_queue_physical_block_size(&self, size: u32) {
+        unsafe { bindings::blk_queue_physical_block_size((*self.gendisk).queue, size) };
+    }
+
+    /// Set the rotational media attribute for the device
+    pub fn set_rotational(&self, rotational: bool) {
+        if !rotational {
+            unsafe {
+                bindings::blk_queue_flag_set(bindings::QUEUE_FLAG_NONROT, (*self.gendisk).queue)
+            };
+        } else {
+            unsafe {
+                bindings::blk_queue_flag_clear(bindings::QUEUE_FLAG_NONROT, (*self.gendisk).queue)
+            };
+        }
+    }
+}
+
+impl<T: Operations> Drop for GenDisk<T> {
+    fn drop(&mut self) {
+        let queue_data = unsafe { (*(*self.gendisk).queue).queuedata };
+
+        unsafe { bindings::del_gendisk(self.gendisk) };
+
+        // SAFETY: `queue.queuedata` was created by `GenDisk::try_new()` with a
+        // call to `ForeignOwnable::into_pointer()` to create `queuedata`.
+        // `ForeignOwnable::from_foreign()` is only called here.
+        let _queue_data = unsafe { T::QueueData::from_foreign(queue_data) };
+    }
+}
diff --git a/rust/kernel/block/mq/operations.rs b/rust/kernel/block/mq/operations.rs
new file mode 100644
index 000000000000..fb1ab707d1f0
--- /dev/null
+++ b/rust/kernel/block/mq/operations.rs
@@ -0,0 +1,260 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! This module provides an interface for blk-mq drivers to implement.
+//!
+//! C header: [`include/linux/blk-mq.h`](../../include/linux/blk-mq.h)
+
+use crate::{
+    bindings,
+    block::mq::{tag_set::TagSetRef, Request},
+    error::{from_result, Result},
+    types::ForeignOwnable,
+};
+use core::{marker::PhantomData, pin::Pin};
+
+/// Implement this trait to interface blk-mq as block devices
+#[macros::vtable]
+pub trait Operations: Sized {
+    /// Data associated with a request. This data is located next to the request
+    /// structure.
+    type RequestData;
+
+    /// Data associated with the `struct request_queue` that is allocated for
+    /// the `GenDisk` associated with this `Operations` implementation.
+    type QueueData: ForeignOwnable;
+
+    /// Data associated with a dispatch queue. This is stored as a pointer in
+    /// `struct blk_mq_hw_ctx`.
+    type HwData: ForeignOwnable;
+
+    /// Data associated with a tag set. This is stored as a pointer in `struct
+    /// blk_mq_tag_set`.
+    type TagSetData: ForeignOwnable;
+
+    /// Called by the kernel to allocate a new `RequestData`. The structure will
+    /// eventually be pinned, so defer initialization to `init_request_data()`
+    fn new_request_data(
+        _tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
+    ) -> Result<Self::RequestData>;
+
+    /// Called by the kernel to initialize a previously allocated `RequestData`
+    fn init_request_data(
+        _tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
+        _data: Pin<&mut Self::RequestData>,
+    ) -> Result {
+        Ok(())
+    }
+
+    /// Called by the kernel to queue a request with the driver. If `is_last` is
+    /// `false`, the driver is allowed to defer commiting the request.
+    fn queue_rq(
+        hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>,
+        queue_data: <Self::QueueData as ForeignOwnable>::Borrowed<'_>,
+        rq: &Request<Self>,
+        is_last: bool,
+    ) -> Result;
+
+    /// Called by the kernel to indicate that queued requests should be submitted
+    fn commit_rqs(
+        hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>,
+        queue_data: <Self::QueueData as ForeignOwnable>::Borrowed<'_>,
+    );
+
+    /// Called by the kernel when the request is completed
+    fn complete(_rq: &Request<Self>);
+
+    /// Called by the kernel to allocate and initialize a driver specific hardware context data
+    fn init_hctx(
+        tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
+        hctx_idx: u32,
+    ) -> Result<Self::HwData>;
+
+    /// Called by the kernel to poll the device for completed requests. Only used for poll queues.
+    fn poll(_hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>) -> i32 {
+        unreachable!()
+    }
+
+    /// Called by the kernel to map submission queues to CPU cores.
+    fn map_queues(_tag_set: &TagSetRef) {
+        unreachable!()
+    }
+
+    // There is no need for exit_request() because `drop` will be called.
+}
+
+pub(crate) struct OperationsVtable<T: Operations>(PhantomData<T>);
+
+impl<T: Operations> OperationsVtable<T> {
+    // # Safety
+    //
+    // The caller of this function must ensure that `hctx` and `bd` are valid
+    // and initialized. The pointees must outlive this function. Further
+    // `hctx->driver_data` must be a pointer created by a call to
+    // `Self::init_hctx_callback()` and the pointee must outlive this function.
+    // This function must not be called with a `hctx` for which
+    // `Self::exit_hctx_callback()` has been called.
+    unsafe extern "C" fn queue_rq_callback(
+        hctx: *mut bindings::blk_mq_hw_ctx,
+        bd: *const bindings::blk_mq_queue_data,
+    ) -> bindings::blk_status_t {
+        // SAFETY: `bd` is valid as required by the safety requirement for this function.
+        let rq = unsafe { (*bd).rq };
+
+        // SAFETY: The safety requirement for this function ensure that
+        // `(*hctx).driver_data` was returned by a call to
+        // `Self::init_hctx_callback()`. That function uses
+        // `PointerWrapper::into_pointer()` to create `driver_data`. Further,
+        // the returned value does not outlive this function and
+        // `from_foreign()` is not called until `Self::exit_hctx_callback()` is
+        // called. By the safety requirement of this function and contract with
+        // the `blk-mq` API, `queue_rq_callback()` will not be called after that
+        // point.
+        let hw_data = unsafe { T::HwData::borrow((*hctx).driver_data) };
+
+        // SAFETY: `hctx` is valid as required by this function.
+        let queue_data = unsafe { (*(*hctx).queue).queuedata };
+
+        // SAFETY: `queue.queuedata` was created by `GenDisk::try_new()` with a
+        // call to `ForeignOwnable::into_pointer()` to create `queuedata`.
+        // `ForeignOwnable::from_foreign()` is only called when the tagset is
+        // dropped, which happens after we are dropped.
+        let queue_data = unsafe { T::QueueData::borrow(queue_data) };
+
+        // SAFETY: `bd` is valid as required by the safety requirement for this function.
+        let ret = T::queue_rq(
+            hw_data,
+            queue_data,
+            &unsafe { Request::from_ptr(rq) },
+            unsafe { (*bd).last },
+        );
+        if let Err(e) = ret {
+            e.to_blk_status()
+        } else {
+            bindings::BLK_STS_OK as _
+        }
+    }
+
+    unsafe extern "C" fn commit_rqs_callback(hctx: *mut bindings::blk_mq_hw_ctx) {
+        let hw_data = unsafe { T::HwData::borrow((*hctx).driver_data) };
+
+        // SAFETY: `hctx` is valid as required by this function.
+        let queue_data = unsafe { (*(*hctx).queue).queuedata };
+
+        // SAFETY: `queue.queuedata` was created by `GenDisk::try_new()` with a
+        // call to `ForeignOwnable::into_pointer()` to create `queuedata`.
+        // `ForeignOwnable::from_foreign()` is only called when the tagset is
+        // dropped, which happens after we are dropped.
+        let queue_data = unsafe { T::QueueData::borrow(queue_data) };
+        T::commit_rqs(hw_data, queue_data)
+    }
+
+    unsafe extern "C" fn complete_callback(rq: *mut bindings::request) {
+        T::complete(&unsafe { Request::from_ptr(rq) });
+    }
+
+    unsafe extern "C" fn poll_callback(
+        hctx: *mut bindings::blk_mq_hw_ctx,
+        _iob: *mut bindings::io_comp_batch,
+    ) -> core::ffi::c_int {
+        let hw_data = unsafe { T::HwData::borrow((*hctx).driver_data) };
+        T::poll(hw_data)
+    }
+
+    unsafe extern "C" fn init_hctx_callback(
+        hctx: *mut bindings::blk_mq_hw_ctx,
+        tagset_data: *mut core::ffi::c_void,
+        hctx_idx: core::ffi::c_uint,
+    ) -> core::ffi::c_int {
+        from_result(|| {
+            let tagset_data = unsafe { T::TagSetData::borrow(tagset_data) };
+            let data = T::init_hctx(tagset_data, hctx_idx)?;
+            unsafe { (*hctx).driver_data = data.into_foreign() as _ };
+            Ok(0)
+        })
+    }
+
+    unsafe extern "C" fn exit_hctx_callback(
+        hctx: *mut bindings::blk_mq_hw_ctx,
+        _hctx_idx: core::ffi::c_uint,
+    ) {
+        let ptr = unsafe { (*hctx).driver_data };
+        unsafe { T::HwData::from_foreign(ptr) };
+    }
+
+    unsafe extern "C" fn init_request_callback(
+        set: *mut bindings::blk_mq_tag_set,
+        rq: *mut bindings::request,
+        _hctx_idx: core::ffi::c_uint,
+        _numa_node: core::ffi::c_uint,
+    ) -> core::ffi::c_int {
+        from_result(|| {
+            // SAFETY: The tagset invariants guarantee that all requests are allocated with extra memory
+            // for the request data.
+            let pdu = unsafe { bindings::blk_mq_rq_to_pdu(rq) } as *mut T::RequestData;
+            let tagset_data = unsafe { T::TagSetData::borrow((*set).driver_data) };
+
+            let v = T::new_request_data(tagset_data)?;
+
+            // SAFETY: `pdu` memory is valid, as it was allocated by the caller.
+            unsafe { pdu.write(v) };
+
+            let tagset_data = unsafe { T::TagSetData::borrow((*set).driver_data) };
+            // SAFETY: `pdu` memory is valid and properly initialised.
+            T::init_request_data(tagset_data, unsafe { Pin::new_unchecked(&mut *pdu) })?;
+
+            Ok(0)
+        })
+    }
+
+    unsafe extern "C" fn exit_request_callback(
+        _set: *mut bindings::blk_mq_tag_set,
+        rq: *mut bindings::request,
+        _hctx_idx: core::ffi::c_uint,
+    ) {
+        // SAFETY: The tagset invariants guarantee that all requests are allocated with extra memory
+        // for the request data.
+        let pdu = unsafe { bindings::blk_mq_rq_to_pdu(rq) } as *mut T::RequestData;
+
+        // SAFETY: `pdu` is valid for read and write and is properly initialised.
+        unsafe { core::ptr::drop_in_place(pdu) };
+    }
+
+    unsafe extern "C" fn map_queues_callback(tag_set_ptr: *mut bindings::blk_mq_tag_set) {
+        let tag_set = unsafe { TagSetRef::from_ptr(tag_set_ptr) };
+        T::map_queues(&tag_set);
+    }
+
+    const VTABLE: bindings::blk_mq_ops = bindings::blk_mq_ops {
+        queue_rq: Some(Self::queue_rq_callback),
+        queue_rqs: None,
+        commit_rqs: Some(Self::commit_rqs_callback),
+        get_budget: None,
+        put_budget: None,
+        set_rq_budget_token: None,
+        get_rq_budget_token: None,
+        timeout: None,
+        poll: if T::HAS_POLL {
+            Some(Self::poll_callback)
+        } else {
+            None
+        },
+        complete: Some(Self::complete_callback),
+        init_hctx: Some(Self::init_hctx_callback),
+        exit_hctx: Some(Self::exit_hctx_callback),
+        init_request: Some(Self::init_request_callback),
+        exit_request: Some(Self::exit_request_callback),
+        cleanup_rq: None,
+        busy: None,
+        map_queues: if T::HAS_MAP_QUEUES {
+            Some(Self::map_queues_callback)
+        } else {
+            None
+        },
+        #[cfg(CONFIG_BLK_DEBUG_FS)]
+        show_rq: None,
+    };
+
+    pub(crate) const unsafe fn build() -> &'static bindings::blk_mq_ops {
+        &Self::VTABLE
+    }
+}
diff --git a/rust/kernel/block/mq/raw_writer.rs b/rust/kernel/block/mq/raw_writer.rs
new file mode 100644
index 000000000000..25c16ee0b1f7
--- /dev/null
+++ b/rust/kernel/block/mq/raw_writer.rs
@@ -0,0 +1,30 @@
+use core::fmt::{self, Write};
+
+pub(crate) struct RawWriter {
+    ptr: *mut u8,
+    len: usize,
+}
+
+impl RawWriter {
+    unsafe fn new(ptr: *mut u8, len: usize) -> Self {
+        Self { ptr, len }
+    }
+
+    pub(crate) fn from_array<const N: usize>(a: &mut [core::ffi::c_char; N]) -> Self {
+        unsafe { Self::new(&mut a[0] as *mut _ as _, N) }
+    }
+}
+
+impl Write for RawWriter {
+    fn write_str(&mut self, s: &str) -> fmt::Result {
+        let bytes = s.as_bytes();
+        let len = bytes.len();
+        if len > self.len {
+            return Err(fmt::Error);
+        }
+        unsafe { core::ptr::copy_nonoverlapping(&bytes[0], self.ptr, len) };
+        self.ptr = unsafe { self.ptr.add(len) };
+        self.len -= len;
+        Ok(())
+    }
+}
diff --git a/rust/kernel/block/mq/request.rs b/rust/kernel/block/mq/request.rs
new file mode 100644
index 000000000000..e95ae3fd71ad
--- /dev/null
+++ b/rust/kernel/block/mq/request.rs
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! This module provides a wrapper for the C `struct request` type.
+//!
+//! C header: [`include/linux/blk-mq.h`](../../include/linux/blk-mq.h)
+
+use crate::{
+    bindings,
+    block::mq::Operations,
+    error::{Error, Result},
+};
+use core::marker::PhantomData;
+
+/// A wrapper around a blk-mq `struct request`. This represents an IO request.
+pub struct Request<T: Operations> {
+    ptr: *mut bindings::request,
+    _p: PhantomData<T>,
+}
+
+impl<T: Operations> Request<T> {
+    pub(crate) unsafe fn from_ptr(ptr: *mut bindings::request) -> Self {
+        Self {
+            ptr,
+            _p: PhantomData,
+        }
+    }
+
+    /// Get the command identifier for the request
+    pub fn command(&self) -> u32 {
+        unsafe { (*self.ptr).cmd_flags & ((1 << bindings::REQ_OP_BITS) - 1) }
+    }
+
+    /// Call this to indicate to the kernel that the request has been issued by the driver
+    pub fn start(&self) {
+        unsafe { bindings::blk_mq_start_request(self.ptr) };
+    }
+
+    /// Call this to indicate to the kernel that the request has been completed without errors
+    // TODO: Consume rq so that we can't use it after ending it?
+    pub fn end_ok(&self) {
+        unsafe { bindings::blk_mq_end_request(self.ptr, bindings::BLK_STS_OK as _) };
+    }
+
+    /// Call this to indicate to the kernel that the request completed with an error
+    pub fn end_err(&self, err: Error) {
+        unsafe { bindings::blk_mq_end_request(self.ptr, err.to_blk_status()) };
+    }
+
+    /// Call this to indicate that the request completed with the status indicated by `status`
+    pub fn end(&self, status: Result) {
+        if let Err(e) = status {
+            self.end_err(e);
+        } else {
+            self.end_ok();
+        }
+    }
+
+    /// Call this to schedule defered completion of the request
+    // TODO: Consume rq so that we can't use it after completing it?
+    pub fn complete(&self) {
+        if !unsafe { bindings::blk_mq_complete_request_remote(self.ptr) } {
+            T::complete(&unsafe { Self::from_ptr(self.ptr) });
+        }
+    }
+
+    /// Get the target sector for the request
+    #[inline(always)]
+    pub fn sector(&self) -> usize {
+        unsafe { (*self.ptr).__sector as usize }
+    }
+}
diff --git a/rust/kernel/block/mq/tag_set.rs b/rust/kernel/block/mq/tag_set.rs
new file mode 100644
index 000000000000..d122db7f6d0e
--- /dev/null
+++ b/rust/kernel/block/mq/tag_set.rs
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! This module provides the `TagSet` struct to wrap the C `struct blk_mq_tag_set`.
+//!
+//! C header: [`include/linux/blk-mq.h`](../../include/linux/blk-mq.h)
+
+use crate::{
+    bindings,
+    block::mq::{operations::OperationsVtable, Operations},
+    error::{Error, Result},
+    sync::Arc,
+    types::ForeignOwnable,
+};
+use core::{cell::UnsafeCell, convert::TryInto, marker::PhantomData};
+
+/// A wrapper for the C `struct blk_mq_tag_set`
+pub struct TagSet<T: Operations> {
+    inner: UnsafeCell<bindings::blk_mq_tag_set>,
+    _p: PhantomData<T>,
+}
+
+impl<T: Operations> TagSet<T> {
+    /// Try to create a new tag set
+    pub fn try_new(
+        nr_hw_queues: u32,
+        tagset_data: T::TagSetData,
+        num_tags: u32,
+        num_maps: u32,
+    ) -> Result<Arc<Self>> {
+        let tagset = Arc::try_new(Self {
+            inner: UnsafeCell::new(bindings::blk_mq_tag_set::default()),
+            _p: PhantomData,
+        })?;
+
+        // SAFETY: We just allocated `tagset`, we know this is the only reference to it.
+        let inner = unsafe { &mut *tagset.inner.get() };
+
+        inner.ops = unsafe { OperationsVtable::<T>::build() };
+        inner.nr_hw_queues = nr_hw_queues;
+        inner.timeout = 0; // 0 means default which is 30 * HZ in C
+        inner.numa_node = bindings::NUMA_NO_NODE;
+        inner.queue_depth = num_tags;
+        inner.cmd_size = core::mem::size_of::<T::RequestData>().try_into()?;
+        inner.flags = bindings::BLK_MQ_F_SHOULD_MERGE;
+        inner.driver_data = tagset_data.into_foreign() as _;
+        inner.nr_maps = num_maps;
+
+        // SAFETY: `inner` points to valid and initialised memory.
+        let ret = unsafe { bindings::blk_mq_alloc_tag_set(inner) };
+        if ret < 0 {
+            // SAFETY: We created `driver_data` above with `into_foreign`
+            unsafe { T::TagSetData::from_foreign(inner.driver_data) };
+            return Err(Error::from_errno(ret));
+        }
+
+        Ok(tagset)
+    }
+
+    /// Return the pointer to the wrapped `struct blk_mq_tag_set`
+    pub(crate) fn raw_tag_set(&self) -> *mut bindings::blk_mq_tag_set {
+        self.inner.get()
+    }
+}
+
+impl<T: Operations> Drop for TagSet<T> {
+    fn drop(&mut self) {
+        let tagset_data = unsafe { (*self.inner.get()).driver_data };
+
+        // SAFETY: `inner` is valid and has been properly initialised during construction.
+        unsafe { bindings::blk_mq_free_tag_set(self.inner.get()) };
+
+        // SAFETY: `tagset_data` was created by a call to
+        // `ForeignOwnable::into_foreign` in `TagSet::try_new()`
+        unsafe { T::TagSetData::from_foreign(tagset_data) };
+    }
+}
+
+/// A tag set reference. Used to control lifetime and prevent drop of TagSet references passed to
+/// `Operations::map_queues()`
+pub struct TagSetRef {
+    ptr: *mut bindings::blk_mq_tag_set,
+}
+
+impl TagSetRef {
+    pub(crate) unsafe fn from_ptr(tagset: *mut bindings::blk_mq_tag_set) -> Self {
+        Self { ptr: tagset }
+    }
+
+    pub fn ptr(&self) -> *mut bindings::blk_mq_tag_set {
+        self.ptr
+    }
+}
diff --git a/rust/kernel/error.rs b/rust/kernel/error.rs
index 5f4114b30b94..421fef677321 100644
--- a/rust/kernel/error.rs
+++ b/rust/kernel/error.rs
@@ -107,6 +107,10 @@ impl Error {
         self.0
     }
 
+    pub(crate) fn to_blk_status(self) -> bindings::blk_status_t {
+        unsafe { bindings::errno_to_blk_status(self.0) }
+    }
+
     /// Returns the error encoded as a pointer.
     #[allow(dead_code)]
     pub(crate) fn to_ptr<T>(self) -> *mut T {
diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
index 8bef6686504b..cd798d12d97c 100644
--- a/rust/kernel/lib.rs
+++ b/rust/kernel/lib.rs
@@ -34,6 +34,7 @@ extern crate self as kernel;
 #[cfg(not(test))]
 #[cfg(not(testlib))]
 mod allocator;
+pub mod block;
 mod build_assert;
 pub mod error;
 pub mod init;
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC PATCH 04/11] rust: block: introduce `kernel::block::bio` module
  2023-05-03  9:06 [RFC PATCH 00/11] Rust null block driver Andreas Hindborg
                   ` (2 preceding siblings ...)
  2023-05-03  9:07 ` [RFC PATCH 03/11] rust: block: introduce `kernel::block::mq` module Andreas Hindborg
@ 2023-05-03  9:07 ` Andreas Hindborg
  2023-05-08 12:58   ` Benno Lossin
  2023-05-03  9:07 ` [RFC PATCH 05/11] RUST: add `module_params` macro Andreas Hindborg
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 69+ messages in thread
From: Andreas Hindborg @ 2023-05-03  9:07 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block
  Cc: Andreas Hindborg, Matthew Wilcox, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	Benno Lossin, Andreas Hindborg, open list, gost.dev

From: Andreas Hindborg <a.hindborg@samsung.com>

Add abstractions for working with `struct bio`.

Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com>
---
 rust/kernel/block.rs            |   1 +
 rust/kernel/block/bio.rs        |  93 ++++++++++++++++
 rust/kernel/block/bio/vec.rs    | 181 ++++++++++++++++++++++++++++++++
 rust/kernel/block/mq/request.rs |  16 +++
 4 files changed, 291 insertions(+)
 create mode 100644 rust/kernel/block/bio.rs
 create mode 100644 rust/kernel/block/bio/vec.rs

diff --git a/rust/kernel/block.rs b/rust/kernel/block.rs
index 4c93317a568a..1797859551fd 100644
--- a/rust/kernel/block.rs
+++ b/rust/kernel/block.rs
@@ -2,4 +2,5 @@
 
 //! Types for working with the block layer
 
+pub mod bio;
 pub mod mq;
diff --git a/rust/kernel/block/bio.rs b/rust/kernel/block/bio.rs
new file mode 100644
index 000000000000..6e93e4420105
--- /dev/null
+++ b/rust/kernel/block/bio.rs
@@ -0,0 +1,93 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! Types for working with the bio layer.
+//!
+//! C header: [`include/linux/blk_types.h`](../../include/linux/blk_types.h)
+
+use core::fmt;
+use core::ptr::NonNull;
+
+mod vec;
+
+pub use vec::BioSegmentIterator;
+pub use vec::Segment;
+
+/// A wrapper around a `struct bio` pointer
+///
+/// # Invariants
+///
+/// First field must alwyas be a valid pointer to a valid `struct bio`.
+pub struct Bio<'a>(
+    NonNull<crate::bindings::bio>,
+    core::marker::PhantomData<&'a ()>,
+);
+
+impl<'a> Bio<'a> {
+    /// Returns an iterator over segments in this `Bio`. Does not consider
+    /// segments of other bios in this bio chain.
+    #[inline(always)]
+    pub fn segment_iter(&'a self) -> BioSegmentIterator<'a> {
+        BioSegmentIterator::new(self)
+    }
+
+    /// Get a pointer to the `bio_vec` off this bio
+    #[inline(always)]
+    fn io_vec(&self) -> *const bindings::bio_vec {
+        // SAFETY: By type invariant, get_raw() returns a valid pointer to a
+        // valid `struct bio`
+        unsafe { (*self.get_raw()).bi_io_vec }
+    }
+
+    /// Return a copy of the `bvec_iter` for this `Bio`
+    #[inline(always)]
+    fn iter(&self) -> bindings::bvec_iter {
+        // SAFETY: self.0 is always a valid pointer
+        unsafe { (*self.get_raw()).bi_iter }
+    }
+
+    /// Get the next `Bio` in the chain
+    #[inline(always)]
+    fn next(&self) -> Option<Bio<'a>> {
+        // SAFETY: self.0 is always a valid pointer
+        let next = unsafe { (*self.get_raw()).bi_next };
+        Some(Self(NonNull::new(next)?, core::marker::PhantomData))
+    }
+
+    /// Return the raw pointer of the wrapped `struct bio`
+    #[inline(always)]
+    fn get_raw(&self) -> *const bindings::bio {
+        self.0.as_ptr()
+    }
+
+    /// Create an instance of `Bio` from a raw pointer. Does check that the
+    /// pointer is not null.
+    #[inline(always)]
+    pub(crate) unsafe fn from_raw(ptr: *mut bindings::bio) -> Option<Bio<'a>> {
+        Some(Self(NonNull::new(ptr)?, core::marker::PhantomData))
+    }
+}
+
+impl core::fmt::Display for Bio<'_> {
+    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+        write!(f, "Bio {:?}", self.0.as_ptr())
+    }
+}
+
+/// An iterator over `Bio`
+pub struct BioIterator<'a> {
+    pub(crate) bio: Option<Bio<'a>>,
+}
+
+impl<'a> core::iter::Iterator for BioIterator<'a> {
+    type Item = Bio<'a>;
+
+    #[inline(always)]
+    fn next(&mut self) -> Option<Bio<'a>> {
+        if let Some(current) = self.bio.take() {
+            self.bio = current.next();
+            Some(current)
+        } else {
+            None
+        }
+    }
+}
diff --git a/rust/kernel/block/bio/vec.rs b/rust/kernel/block/bio/vec.rs
new file mode 100644
index 000000000000..acd328a6fe54
--- /dev/null
+++ b/rust/kernel/block/bio/vec.rs
@@ -0,0 +1,181 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! Types for working with `struct bio_vec` IO vectors
+//!
+//! C header: [`include/linux/bvec.h`](../../include/linux/bvec.h)
+
+use super::Bio;
+use crate::error::Result;
+use crate::pages::Pages;
+use core::fmt;
+use core::mem::ManuallyDrop;
+
+#[inline(always)]
+fn mp_bvec_iter_offset(bvec: *const bindings::bio_vec, iter: &bindings::bvec_iter) -> u32 {
+    (unsafe { (*bvec_iter_bvec(bvec, iter)).bv_offset }) + iter.bi_bvec_done
+}
+
+#[inline(always)]
+fn mp_bvec_iter_page(
+    bvec: *const bindings::bio_vec,
+    iter: &bindings::bvec_iter,
+) -> *mut bindings::page {
+    unsafe { (*bvec_iter_bvec(bvec, iter)).bv_page }
+}
+
+#[inline(always)]
+fn mp_bvec_iter_page_idx(bvec: *const bindings::bio_vec, iter: &bindings::bvec_iter) -> usize {
+    (mp_bvec_iter_offset(bvec, iter) / crate::PAGE_SIZE) as usize
+}
+
+#[inline(always)]
+fn mp_bvec_iter_len(bvec: *const bindings::bio_vec, iter: &bindings::bvec_iter) -> u32 {
+    iter.bi_size
+        .min(unsafe { (*bvec_iter_bvec(bvec, iter)).bv_len } - iter.bi_bvec_done)
+}
+
+#[inline(always)]
+fn bvec_iter_bvec(
+    bvec: *const bindings::bio_vec,
+    iter: &bindings::bvec_iter,
+) -> *const bindings::bio_vec {
+    unsafe { bvec.add(iter.bi_idx as usize) }
+}
+
+#[inline(always)]
+fn bvec_iter_page(
+    bvec: *const bindings::bio_vec,
+    iter: &bindings::bvec_iter,
+) -> *mut bindings::page {
+    unsafe { mp_bvec_iter_page(bvec, iter).add(mp_bvec_iter_page_idx(bvec, iter)) }
+}
+
+#[inline(always)]
+fn bvec_iter_len(bvec: *const bindings::bio_vec, iter: &bindings::bvec_iter) -> u32 {
+    mp_bvec_iter_len(bvec, iter).min(crate::PAGE_SIZE - bvec_iter_offset(bvec, iter))
+}
+
+#[inline(always)]
+fn bvec_iter_offset(bvec: *const bindings::bio_vec, iter: &bindings::bvec_iter) -> u32 {
+    mp_bvec_iter_offset(bvec, iter) % crate::PAGE_SIZE
+}
+
+/// A wrapper around a `strutct bio_vec` - a contiguous range of physical memory addresses
+///
+/// # Invariants
+///
+/// `bio_vec` must always be initialized and valid
+pub struct Segment<'a> {
+    bio_vec: bindings::bio_vec,
+    _marker: core::marker::PhantomData<&'a ()>,
+}
+
+impl Segment<'_> {
+    /// Get he lenght of the segment in bytes
+    #[inline(always)]
+    pub fn len(&self) -> usize {
+        self.bio_vec.bv_len as usize
+    }
+
+    /// Returns true if the length of the segment is 0
+    #[inline(always)]
+    pub fn is_empty(&self) -> bool {
+        self.len() == 0
+    }
+
+    /// Get the offset field of the `bio_vec`
+    #[inline(always)]
+    pub fn offset(&self) -> usize {
+        self.bio_vec.bv_offset as usize
+    }
+
+    /// Copy data of this segment into `page`.
+    #[inline(always)]
+    pub fn copy_to_page_atomic(&self, page: &mut Pages<0>) -> Result {
+        // SAFETY: self.bio_vec is valid and thus bv_page must be a valid
+        // pointer to a `struct page`. We do not own the page, but we prevent
+        // drop by wrapping the `Pages` in `ManuallyDrop`.
+        let our_page = ManuallyDrop::new(unsafe { Pages::<0>::from_raw(self.bio_vec.bv_page) });
+        let our_map = our_page.kmap_atomic();
+
+        // TODO: Checck offset is within page - what guarantees does `bio_vec` provide?
+        let ptr = unsafe { (our_map.get_ptr() as *const u8).add(self.offset()) };
+
+        unsafe { page.write_atomic(ptr, self.offset(), self.len()) }
+    }
+
+    /// Copy data from `page` into this segment
+    #[inline(always)]
+    pub fn copy_from_page_atomic(&mut self, page: &Pages<0>) -> Result {
+        // SAFETY: self.bio_vec is valid and thus bv_page must be a valid
+        // pointer to a `struct page`. We do not own the page, but we prevent
+        // drop by wrapping the `Pages` in `ManuallyDrop`.
+        let our_page = ManuallyDrop::new(unsafe { Pages::<0>::from_raw(self.bio_vec.bv_page) });
+        let our_map = our_page.kmap_atomic();
+
+        // TODO: Checck offset is within page
+        let ptr = unsafe { (our_map.get_ptr() as *mut u8).add(self.offset()) };
+
+        unsafe { page.read_atomic(ptr, self.offset(), self.len()) }
+    }
+}
+
+impl core::fmt::Display for Segment<'_> {
+    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+        write!(
+            f,
+            "Segment {:?} len: {}",
+            self.bio_vec.bv_page, self.bio_vec.bv_len
+        )
+    }
+}
+
+/// An iterator over `Segment`
+pub struct BioSegmentIterator<'a> {
+    bio: &'a Bio<'a>,
+    iter: bindings::bvec_iter,
+}
+
+impl<'a> BioSegmentIterator<'a> {
+    #[inline(always)]
+    pub(crate) fn new(bio: &'a Bio<'a>) -> BioSegmentIterator<'_> {
+        Self {
+            bio,
+            iter: bio.iter(),
+        }
+    }
+}
+
+impl<'a> core::iter::Iterator for BioSegmentIterator<'a> {
+    type Item = Segment<'a>;
+
+    #[inline(always)]
+    fn next(&mut self) -> Option<Self::Item> {
+        if self.iter.bi_size == 0 {
+            return None;
+        }
+
+        // Macro
+        // bio_vec = bio_iter_iovec(bio, self.iter)
+        // bio_vec = bvec_iter_bvec(bio.bi_io_vec, self.iter);
+        let bio_vec_ret = bindings::bio_vec {
+            bv_page: bvec_iter_page(self.bio.io_vec(), &self.iter),
+            bv_len: bvec_iter_len(self.bio.io_vec(), &self.iter),
+            bv_offset: bvec_iter_offset(self.bio.io_vec(), &self.iter),
+        };
+
+        // Static C function
+        unsafe {
+            bindings::bio_advance_iter_single(
+                self.bio.get_raw(),
+                &mut self.iter as *mut bindings::bvec_iter,
+                bio_vec_ret.bv_len,
+            )
+        };
+
+        Some(Segment {
+            bio_vec: bio_vec_ret,
+            _marker: core::marker::PhantomData,
+        })
+    }
+}
diff --git a/rust/kernel/block/mq/request.rs b/rust/kernel/block/mq/request.rs
index e95ae3fd71ad..ccb1033b64b6 100644
--- a/rust/kernel/block/mq/request.rs
+++ b/rust/kernel/block/mq/request.rs
@@ -11,6 +11,9 @@ use crate::{
 };
 use core::marker::PhantomData;
 
+use crate::block::bio::Bio;
+use crate::block::bio::BioIterator;
+
 /// A wrapper around a blk-mq `struct request`. This represents an IO request.
 pub struct Request<T: Operations> {
     ptr: *mut bindings::request,
@@ -63,6 +66,19 @@ impl<T: Operations> Request<T> {
         }
     }
 
+    /// Get a wrapper for the first Bio in this request
+    #[inline(always)]
+    pub fn bio(&self) -> Option<Bio<'_>> {
+        let ptr = unsafe { (*self.ptr).bio };
+        unsafe { Bio::from_raw(ptr) }
+    }
+
+    /// Get an iterator over all bio structurs in this request
+    #[inline(always)]
+    pub fn bio_iter(&self) -> BioIterator<'_> {
+        BioIterator { bio: self.bio() }
+    }
+
     /// Get the target sector for the request
     #[inline(always)]
     pub fn sector(&self) -> usize {
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC PATCH 05/11] RUST: add `module_params` macro
  2023-05-03  9:06 [RFC PATCH 00/11] Rust null block driver Andreas Hindborg
                   ` (3 preceding siblings ...)
  2023-05-03  9:07 ` [RFC PATCH 04/11] rust: block: introduce `kernel::block::bio` module Andreas Hindborg
@ 2023-05-03  9:07 ` Andreas Hindborg
  2023-05-03  9:07 ` [RFC PATCH 06/11] rust: apply cache line padding for `SpinLock` Andreas Hindborg
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 69+ messages in thread
From: Andreas Hindborg @ 2023-05-03  9:07 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block
  Cc: Andreas Hindborg, Matthew Wilcox, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	Benno Lossin, Andreas Hindborg, open list, gost.dev

From: Andreas Hindborg <a.hindborg@samsung.com>

This patch includes changes required for Rust kernel modules to utilize module
parameters.

Imported with minor changes from `rust` branch [1], see original code for
attributions.

[1] https://github.com/Rust-for-Linux/linux/tree/bc22545f38d74473cfef3e9fd65432733435b79f

Cc: adamrk <ark.email@gmail.com>
---
 rust/kernel/lib.rs          |   3 +
 rust/kernel/module_param.rs | 501 ++++++++++++++++++++++++++++++++++++
 rust/macros/helpers.rs      |  46 +++-
 rust/macros/module.rs       | 402 +++++++++++++++++++++++++++--
 4 files changed, 920 insertions(+), 32 deletions(-)
 create mode 100644 rust/kernel/module_param.rs

diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
index cd798d12d97c..a0bd0b0e2aef 100644
--- a/rust/kernel/lib.rs
+++ b/rust/kernel/lib.rs
@@ -22,6 +22,7 @@
 #![feature(pin_macro)]
 #![feature(receiver_trait)]
 #![feature(unsize)]
+#![feature(const_mut_refs)]
 
 // Ensure conditional compilation based on the kernel configuration works;
 // otherwise we may silently break things like initcall handling.
@@ -39,6 +40,8 @@ mod build_assert;
 pub mod error;
 pub mod init;
 pub mod ioctl;
+#[doc(hidden)]
+pub mod module_param;
 pub mod pages;
 pub mod prelude;
 pub mod print;
diff --git a/rust/kernel/module_param.rs b/rust/kernel/module_param.rs
new file mode 100644
index 000000000000..539decbdcd48
--- /dev/null
+++ b/rust/kernel/module_param.rs
@@ -0,0 +1,501 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! Types for module parameters.
+//!
+//! C header: [`include/linux/moduleparam.h`](../../../include/linux/moduleparam.h)
+
+use crate::error::{code::*, from_result};
+use crate::str::{CStr, Formatter};
+use core::fmt::Write;
+
+/// Types that can be used for module parameters.
+///
+/// Note that displaying the type in `sysfs` will fail if
+/// [`alloc::string::ToString::to_string`] (as implemented through the
+/// [`core::fmt::Display`] trait) writes more than [`PAGE_SIZE`]
+/// bytes (including an additional null terminator).
+///
+/// [`PAGE_SIZE`]: `crate::PAGE_SIZE`
+pub trait ModuleParam: core::fmt::Display + core::marker::Sized {
+    /// The `ModuleParam` will be used by the kernel module through this type.
+    ///
+    /// This may differ from `Self` if, for example, `Self` needs to track
+    /// ownership without exposing it or allocate extra space for other possible
+    /// parameter values. See [`StringParam`] or [`ArrayParam`] for examples.
+    type Value: ?Sized;
+
+    /// Whether the parameter is allowed to be set without an argument.
+    ///
+    /// Setting this to `true` allows the parameter to be passed without an
+    /// argument (e.g. just `module.param` instead of `module.param=foo`).
+    const NOARG_ALLOWED: bool;
+
+    /// Convert a parameter argument into the parameter value.
+    ///
+    /// `None` should be returned when parsing of the argument fails.
+    /// `arg == None` indicates that the parameter was passed without an
+    /// argument. If `NOARG_ALLOWED` is set to `false` then `arg` is guaranteed
+    /// to always be `Some(_)`.
+    ///
+    /// Parameters passed at boot time will be set before [`kmalloc`] is
+    /// available (even if the module is loaded at a later time). However, in
+    /// this case, the argument buffer will be valid for the entire lifetime of
+    /// the kernel. So implementations of this method which need to allocate
+    /// should first check that the allocator is available (with
+    /// [`crate::bindings::slab_is_available`]) and when it is not available
+    /// provide an alternative implementation which doesn't allocate. In cases
+    /// where the allocator is not available it is safe to save references to
+    /// `arg` in `Self`, but in other cases a copy should be made.
+    ///
+    /// [`kmalloc`]: ../../../include/linux/slab.h
+    fn try_from_param_arg(arg: Option<&'static [u8]>) -> Option<Self>;
+
+    /// Get the current value of the parameter for use in the kernel module.
+    ///
+    /// This function should not be used directly. Instead use the wrapper
+    /// `read` which will be generated by [`macros::module`].
+    fn value(&self) -> &Self::Value;
+
+    /// Set the module parameter from a string.
+    ///
+    /// Used to set the parameter value when loading the module or when set
+    /// through `sysfs`.
+    ///
+    /// # Safety
+    ///
+    /// If `val` is non-null then it must point to a valid null-terminated
+    /// string. The `arg` field of `param` must be an instance of `Self`.
+    unsafe extern "C" fn set_param(
+        val: *const core::ffi::c_char,
+        param: *const crate::bindings::kernel_param,
+    ) -> core::ffi::c_int {
+        let arg = if val.is_null() {
+            None
+        } else {
+            Some(unsafe { CStr::from_char_ptr(val).as_bytes() })
+        };
+        match Self::try_from_param_arg(arg) {
+            Some(new_value) => {
+                let old_value = unsafe { (*param).__bindgen_anon_1.arg as *mut Self };
+                let _ = unsafe { core::ptr::replace(old_value, new_value) };
+                0
+            }
+            None => EINVAL.to_errno(),
+        }
+    }
+
+    /// Write a string representation of the current parameter value to `buf`.
+    ///
+    /// Used for displaying the current parameter value in `sysfs`.
+    ///
+    /// # Safety
+    ///
+    /// `buf` must be a buffer of length at least `kernel::PAGE_SIZE` that is
+    /// writeable. The `arg` field of `param` must be an instance of `Self`.
+    unsafe extern "C" fn get_param(
+        buf: *mut core::ffi::c_char,
+        param: *const crate::bindings::kernel_param,
+    ) -> core::ffi::c_int {
+        from_result(|| {
+            // SAFETY: The C contracts guarantees that the buffer is at least `PAGE_SIZE` bytes.
+            let mut f = unsafe { Formatter::from_buffer(buf.cast(), crate::PAGE_SIZE as usize) };
+            unsafe { write!(f, "{}\0", *((*param).__bindgen_anon_1.arg as *mut Self)) }?;
+            Ok(f.bytes_written().try_into()?)
+        })
+    }
+
+    /// Drop the parameter.
+    ///
+    /// Called when unloading a module.
+    ///
+    /// # Safety
+    ///
+    /// The `arg` field of `param` must be an instance of `Self`.
+    unsafe extern "C" fn free(arg: *mut core::ffi::c_void) {
+        unsafe { core::ptr::drop_in_place(arg as *mut Self) };
+    }
+}
+
+/// Trait for parsing integers.
+///
+/// Strings beginning with `0x`, `0o`, or `0b` are parsed as hex, octal, or
+/// binary respectively. Strings beginning with `0` otherwise are parsed as
+/// octal. Anything else is parsed as decimal. A leading `+` or `-` is also
+/// permitted. Any string parsed by [`kstrtol()`] or [`kstrtoul()`] will be
+/// successfully parsed.
+///
+/// [`kstrtol()`]: https://www.kernel.org/doc/html/latest/core-api/kernel-api.html#c.kstrtol
+/// [`kstrtoul()`]: https://www.kernel.org/doc/html/latest/core-api/kernel-api.html#c.kstrtoul
+trait ParseInt: Sized {
+    fn from_str_radix(src: &str, radix: u32) -> Result<Self, core::num::ParseIntError>;
+    fn checked_neg(self) -> Option<Self>;
+
+    fn from_str_unsigned(src: &str) -> Result<Self, core::num::ParseIntError> {
+        let (radix, digits) = if let Some(n) = src.strip_prefix("0x") {
+            (16, n)
+        } else if let Some(n) = src.strip_prefix("0X") {
+            (16, n)
+        } else if let Some(n) = src.strip_prefix("0o") {
+            (8, n)
+        } else if let Some(n) = src.strip_prefix("0O") {
+            (8, n)
+        } else if let Some(n) = src.strip_prefix("0b") {
+            (2, n)
+        } else if let Some(n) = src.strip_prefix("0B") {
+            (2, n)
+        } else if src.starts_with('0') {
+            (8, src)
+        } else {
+            (10, src)
+        };
+        Self::from_str_radix(digits, radix)
+    }
+
+    fn from_str(src: &str) -> Option<Self> {
+        match src.bytes().next() {
+            None => None,
+            Some(b'-') => Self::from_str_unsigned(&src[1..]).ok()?.checked_neg(),
+            Some(b'+') => Some(Self::from_str_unsigned(&src[1..]).ok()?),
+            Some(_) => Some(Self::from_str_unsigned(src).ok()?),
+        }
+    }
+}
+
+macro_rules! impl_parse_int {
+    ($ty:ident) => {
+        impl ParseInt for $ty {
+            fn from_str_radix(src: &str, radix: u32) -> Result<Self, core::num::ParseIntError> {
+                $ty::from_str_radix(src, radix)
+            }
+
+            fn checked_neg(self) -> Option<Self> {
+                self.checked_neg()
+            }
+        }
+    };
+}
+
+impl_parse_int!(i8);
+impl_parse_int!(u8);
+impl_parse_int!(i16);
+impl_parse_int!(u16);
+impl_parse_int!(i32);
+impl_parse_int!(u32);
+impl_parse_int!(i64);
+impl_parse_int!(u64);
+impl_parse_int!(isize);
+impl_parse_int!(usize);
+
+macro_rules! impl_module_param {
+    ($ty:ident) => {
+        impl ModuleParam for $ty {
+            type Value = $ty;
+
+            const NOARG_ALLOWED: bool = false;
+
+            fn try_from_param_arg(arg: Option<&'static [u8]>) -> Option<Self> {
+                let bytes = arg?;
+                let utf8 = core::str::from_utf8(bytes).ok()?;
+                <$ty as crate::module_param::ParseInt>::from_str(utf8)
+            }
+
+            #[inline(always)]
+            fn value(&self) -> &Self::Value {
+                self
+            }
+        }
+    };
+}
+
+#[doc(hidden)]
+#[macro_export]
+/// Generate a static [`kernel_param_ops`](../../../include/linux/moduleparam.h) struct.
+///
+/// # Examples
+///
+/// ```ignore
+/// make_param_ops!(
+///     /// Documentation for new param ops.
+///     PARAM_OPS_MYTYPE, // Name for the static.
+///     MyType // A type which implements [`ModuleParam`].
+/// );
+/// ```
+macro_rules! make_param_ops {
+    ($ops:ident, $ty:ty) => {
+        $crate::make_param_ops!(
+            #[doc=""]
+            $ops,
+            $ty
+        );
+    };
+    ($(#[$meta:meta])* $ops:ident, $ty:ty) => {
+        $(#[$meta])*
+        ///
+        /// Static [`kernel_param_ops`](../../../include/linux/moduleparam.h)
+        /// struct generated by [`make_param_ops`].
+        pub static $ops: $crate::bindings::kernel_param_ops = $crate::bindings::kernel_param_ops {
+            flags: if <$ty as $crate::module_param::ModuleParam>::NOARG_ALLOWED {
+                $crate::bindings::KERNEL_PARAM_OPS_FL_NOARG
+            } else {
+                0
+            },
+            set: Some(<$ty as $crate::module_param::ModuleParam>::set_param),
+            get: Some(<$ty as $crate::module_param::ModuleParam>::get_param),
+            free: Some(<$ty as $crate::module_param::ModuleParam>::free),
+        };
+    };
+}
+
+impl_module_param!(i8);
+impl_module_param!(u8);
+impl_module_param!(i16);
+impl_module_param!(u16);
+impl_module_param!(i32);
+impl_module_param!(u32);
+impl_module_param!(i64);
+impl_module_param!(u64);
+impl_module_param!(isize);
+impl_module_param!(usize);
+
+make_param_ops!(
+    /// Rust implementation of [`kernel_param_ops`](../../../include/linux/moduleparam.h)
+    /// for [`i8`].
+    PARAM_OPS_I8,
+    i8
+);
+make_param_ops!(
+    /// Rust implementation of [`kernel_param_ops`](../../../include/linux/moduleparam.h)
+    /// for [`u8`].
+    PARAM_OPS_U8,
+    u8
+);
+make_param_ops!(
+    /// Rust implementation of [`kernel_param_ops`](../../../include/linux/moduleparam.h)
+    /// for [`i16`].
+    PARAM_OPS_I16,
+    i16
+);
+make_param_ops!(
+    /// Rust implementation of [`kernel_param_ops`](../../../include/linux/moduleparam.h)
+    /// for [`u16`].
+    PARAM_OPS_U16,
+    u16
+);
+make_param_ops!(
+    /// Rust implementation of [`kernel_param_ops`](../../../include/linux/moduleparam.h)
+    /// for [`i32`].
+    PARAM_OPS_I32,
+    i32
+);
+make_param_ops!(
+    /// Rust implementation of [`kernel_param_ops`](../../../include/linux/moduleparam.h)
+    /// for [`u32`].
+    PARAM_OPS_U32,
+    u32
+);
+make_param_ops!(
+    /// Rust implementation of [`kernel_param_ops`](../../../include/linux/moduleparam.h)
+    /// for [`i64`].
+    PARAM_OPS_I64,
+    i64
+);
+make_param_ops!(
+    /// Rust implementation of [`kernel_param_ops`](../../../include/linux/moduleparam.h)
+    /// for [`u64`].
+    PARAM_OPS_U64,
+    u64
+);
+make_param_ops!(
+    /// Rust implementation of [`kernel_param_ops`](../../../include/linux/moduleparam.h)
+    /// for [`isize`].
+    PARAM_OPS_ISIZE,
+    isize
+);
+make_param_ops!(
+    /// Rust implementation of [`kernel_param_ops`](../../../include/linux/moduleparam.h)
+    /// for [`usize`].
+    PARAM_OPS_USIZE,
+    usize
+);
+
+impl ModuleParam for bool {
+    type Value = bool;
+
+    const NOARG_ALLOWED: bool = true;
+
+    fn try_from_param_arg(arg: Option<&'static [u8]>) -> Option<Self> {
+        match arg {
+            None => Some(true),
+            Some(b"y") | Some(b"Y") | Some(b"1") | Some(b"true") => Some(true),
+            Some(b"n") | Some(b"N") | Some(b"0") | Some(b"false") => Some(false),
+            _ => None,
+        }
+    }
+
+    #[inline(always)]
+    fn value(&self) -> &Self::Value {
+        self
+    }
+}
+
+make_param_ops!(
+    /// Rust implementation of [`kernel_param_ops`](../../../include/linux/moduleparam.h)
+    /// for [`bool`].
+    PARAM_OPS_BOOL,
+    bool
+);
+
+/// An array of at __most__ `N` values.
+///
+/// # Invariant
+///
+/// The first `self.used` elements of `self.values` are initialized.
+pub struct ArrayParam<T, const N: usize> {
+    values: [core::mem::MaybeUninit<T>; N],
+    used: usize,
+}
+
+impl<T, const N: usize> ArrayParam<T, { N }> {
+    fn values(&self) -> &[T] {
+        // SAFETY: The invariant maintained by `ArrayParam` allows us to cast
+        // the first `self.used` elements to `T`.
+        unsafe {
+            &*(&self.values[0..self.used] as *const [core::mem::MaybeUninit<T>] as *const [T])
+        }
+    }
+}
+
+impl<T: Copy, const N: usize> ArrayParam<T, { N }> {
+    const fn new() -> Self {
+        // INVARIANT: The first `self.used` elements of `self.values` are
+        // initialized.
+        ArrayParam {
+            values: [core::mem::MaybeUninit::uninit(); N],
+            used: 0,
+        }
+    }
+
+    const fn push(&mut self, val: T) {
+        if self.used < N {
+            // INVARIANT: The first `self.used` elements of `self.values` are
+            // initialized.
+            self.values[self.used] = core::mem::MaybeUninit::new(val);
+            self.used += 1;
+        }
+    }
+
+    /// Create an instance of `ArrayParam` initialized with `vals`.
+    ///
+    /// This function is only meant to be used in the [`module::module`] macro.
+    pub const fn create(vals: &[T]) -> Self {
+        let mut result = ArrayParam::new();
+        let mut i = 0;
+        while i < vals.len() {
+            result.push(vals[i]);
+            i += 1;
+        }
+        result
+    }
+}
+
+impl<T: core::fmt::Display, const N: usize> core::fmt::Display for ArrayParam<T, { N }> {
+    fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {
+        for val in self.values() {
+            write!(f, "{},", val)?;
+        }
+        Ok(())
+    }
+}
+
+impl<T: Copy + core::fmt::Display + ModuleParam, const N: usize> ModuleParam
+    for ArrayParam<T, { N }>
+{
+    type Value = [T];
+
+    const NOARG_ALLOWED: bool = false;
+
+    fn try_from_param_arg(arg: Option<&'static [u8]>) -> Option<Self> {
+        arg.and_then(|args| {
+            let mut result = Self::new();
+            for arg in args.split(|b| *b == b',') {
+                result.push(T::try_from_param_arg(Some(arg))?);
+            }
+            Some(result)
+        })
+    }
+
+    fn value(&self) -> &Self::Value {
+        self.values()
+    }
+}
+
+/// A C-style string parameter.
+///
+/// The Rust version of the [`charp`] parameter. This type is meant to be
+/// used by the [`macros::module`] macro, not handled directly. Instead use the
+/// `read` method generated by that macro.
+///
+/// [`charp`]: ../../../include/linux/moduleparam.h
+pub enum StringParam {
+    /// A borrowed parameter value.
+    ///
+    /// Either the default value (which is static in the module) or borrowed
+    /// from the original argument buffer used to set the value.
+    Ref(&'static [u8]),
+
+    /// A value that was allocated when the parameter was set.
+    ///
+    /// The value needs to be freed when the parameter is reset or the module is
+    /// unloaded.
+    Owned(alloc::vec::Vec<u8>),
+}
+
+impl StringParam {
+    fn bytes(&self) -> &[u8] {
+        match self {
+            StringParam::Ref(bytes) => *bytes,
+            StringParam::Owned(vec) => &vec[..],
+        }
+    }
+}
+
+impl core::fmt::Display for StringParam {
+    fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {
+        let bytes = self.bytes();
+        match core::str::from_utf8(bytes) {
+            Ok(utf8) => write!(f, "{}", utf8),
+            Err(_) => write!(f, "{:?}", bytes),
+        }
+    }
+}
+
+impl ModuleParam for StringParam {
+    type Value = [u8];
+
+    const NOARG_ALLOWED: bool = false;
+
+    fn try_from_param_arg(arg: Option<&'static [u8]>) -> Option<Self> {
+        // SAFETY: It is always safe to call [`slab_is_available`](../../../include/linux/slab.h).
+        let slab_available = unsafe { crate::bindings::slab_is_available() };
+        arg.and_then(|arg| {
+            if slab_available {
+                let mut vec = alloc::vec::Vec::new();
+                vec.try_extend_from_slice(arg).ok()?;
+                Some(StringParam::Owned(vec))
+            } else {
+                Some(StringParam::Ref(arg))
+            }
+        })
+    }
+
+    fn value(&self) -> &Self::Value {
+        self.bytes()
+    }
+}
+
+make_param_ops!(
+    /// Rust implementation of [`kernel_param_ops`](../../../include/linux/moduleparam.h)
+    /// for [`StringParam`].
+    PARAM_OPS_STR,
+    StringParam
+);
diff --git a/rust/macros/helpers.rs b/rust/macros/helpers.rs
index b2bdd4d8c958..8b5f9bc414d7 100644
--- a/rust/macros/helpers.rs
+++ b/rust/macros/helpers.rs
@@ -18,6 +18,16 @@ pub(crate) fn try_literal(it: &mut token_stream::IntoIter) -> Option<String> {
     }
 }
 
+pub(crate) fn try_byte_string(it: &mut token_stream::IntoIter) -> Option<String> {
+    try_literal(it).and_then(|byte_string| {
+        if byte_string.starts_with("b\"") && byte_string.ends_with('\"') {
+            Some(byte_string[2..byte_string.len() - 1].to_string())
+        } else {
+            None
+        }
+    })
+}
+
 pub(crate) fn try_string(it: &mut token_stream::IntoIter) -> Option<String> {
     try_literal(it).and_then(|string| {
         if string.starts_with('\"') && string.ends_with('\"') {
@@ -46,14 +56,8 @@ pub(crate) fn expect_punct(it: &mut token_stream::IntoIter) -> char {
     }
 }
 
-pub(crate) fn expect_string(it: &mut token_stream::IntoIter) -> String {
-    try_string(it).expect("Expected string")
-}
-
-pub(crate) fn expect_string_ascii(it: &mut token_stream::IntoIter) -> String {
-    let string = try_string(it).expect("Expected string");
-    assert!(string.is_ascii(), "Expected ASCII string");
-    string
+pub(crate) fn expect_literal(it: &mut token_stream::IntoIter) -> String {
+    try_literal(it).expect("Expected Literal")
 }
 
 pub(crate) fn expect_group(it: &mut token_stream::IntoIter) -> Group {
@@ -64,8 +68,34 @@ pub(crate) fn expect_group(it: &mut token_stream::IntoIter) -> Group {
     }
 }
 
+pub(crate) fn expect_string(it: &mut token_stream::IntoIter) -> String {
+    try_string(it).expect("Expected string")
+}
+
+pub(crate) fn expect_string_ascii(it: &mut token_stream::IntoIter) -> String {
+    let string = try_string(it).expect("Expected string");
+    assert!(string.is_ascii(), "Expected ASCII string");
+    string
+}
+
 pub(crate) fn expect_end(it: &mut token_stream::IntoIter) {
     if it.next().is_some() {
         panic!("Expected end");
     }
 }
+
+pub(crate) fn get_literal(it: &mut token_stream::IntoIter, expected_name: &str) -> String {
+    assert_eq!(expect_ident(it), expected_name);
+    assert_eq!(expect_punct(it), ':');
+    let literal = expect_literal(it);
+    assert_eq!(expect_punct(it), ',');
+    literal
+}
+
+pub(crate) fn get_string(it: &mut token_stream::IntoIter, expected_name: &str) -> String {
+    assert_eq!(expect_ident(it), expected_name);
+    assert_eq!(expect_punct(it), ':');
+    let string = expect_string(it);
+    assert_eq!(expect_punct(it), ',');
+    string
+}
diff --git a/rust/macros/module.rs b/rust/macros/module.rs
index fb1244f8c2e6..172631e7fb05 100644
--- a/rust/macros/module.rs
+++ b/rust/macros/module.rs
@@ -1,25 +1,40 @@
 // SPDX-License-Identifier: GPL-2.0
 
-use crate::helpers::*;
-use proc_macro::{token_stream, Delimiter, Literal, TokenStream, TokenTree};
+use proc_macro::{token_stream, Delimiter, Group, Literal, TokenStream, TokenTree};
 use std::fmt::Write;
 
-fn expect_string_array(it: &mut token_stream::IntoIter) -> Vec<String> {
-    let group = expect_group(it);
-    assert_eq!(group.delimiter(), Delimiter::Bracket);
-    let mut values = Vec::new();
-    let mut it = group.stream().into_iter();
-
-    while let Some(val) = try_string(&mut it) {
-        assert!(val.is_ascii(), "Expected ASCII string");
-        values.push(val);
-        match it.next() {
-            Some(TokenTree::Punct(punct)) => assert_eq!(punct.as_char(), ','),
-            None => break,
-            _ => panic!("Expected ',' or end of array"),
+use crate::helpers::*;
+
+#[derive(Clone, PartialEq)]
+enum ParamType {
+    Ident(String),
+    Array { vals: String, max_length: usize },
+}
+
+fn expect_array_fields(it: &mut token_stream::IntoIter) -> ParamType {
+    assert_eq!(expect_punct(it), '<');
+    let vals = expect_ident(it);
+    assert_eq!(expect_punct(it), ',');
+    let max_length_str = expect_literal(it);
+    let max_length = max_length_str
+        .parse::<usize>()
+        .expect("Expected usize length");
+    assert_eq!(expect_punct(it), '>');
+    ParamType::Array { vals, max_length }
+}
+
+fn expect_type(it: &mut token_stream::IntoIter) -> ParamType {
+    if let TokenTree::Ident(ident) = it
+        .next()
+        .expect("Reached end of token stream for param type")
+    {
+        match ident.to_string().as_ref() {
+            "ArrayParam" => expect_array_fields(it),
+            _ => ParamType::Ident(ident.to_string()),
         }
+    } else {
+        panic!("Expected Param Type")
     }
-    values
 }
 
 struct ModInfoBuilder<'a> {
@@ -87,6 +102,113 @@ impl<'a> ModInfoBuilder<'a> {
         self.emit_only_builtin(field, content);
         self.emit_only_loadable(field, content);
     }
+
+    fn emit_param(&mut self, field: &str, param: &str, content: &str) {
+        let content = format!("{param}:{content}", param = param, content = content);
+        self.emit(field, &content);
+    }
+}
+
+fn permissions_are_readonly(perms: &str) -> bool {
+    let (radix, digits) = if let Some(n) = perms.strip_prefix("0x") {
+        (16, n)
+    } else if let Some(n) = perms.strip_prefix("0o") {
+        (8, n)
+    } else if let Some(n) = perms.strip_prefix("0b") {
+        (2, n)
+    } else {
+        (10, perms)
+    };
+    match u32::from_str_radix(digits, radix) {
+        Ok(perms) => perms & 0o222 == 0,
+        Err(_) => false,
+    }
+}
+
+fn param_ops_path(param_type: &str) -> &'static str {
+    match param_type {
+        "bool" => "kernel::module_param::PARAM_OPS_BOOL",
+        "i8" => "kernel::module_param::PARAM_OPS_I8",
+        "u8" => "kernel::module_param::PARAM_OPS_U8",
+        "i16" => "kernel::module_param::PARAM_OPS_I16",
+        "u16" => "kernel::module_param::PARAM_OPS_U16",
+        "i32" => "kernel::module_param::PARAM_OPS_I32",
+        "u32" => "kernel::module_param::PARAM_OPS_U32",
+        "i64" => "kernel::module_param::PARAM_OPS_I64",
+        "u64" => "kernel::module_param::PARAM_OPS_U64",
+        "isize" => "kernel::module_param::PARAM_OPS_ISIZE",
+        "usize" => "kernel::module_param::PARAM_OPS_USIZE",
+        "str" => "kernel::module_param::PARAM_OPS_STR",
+        t => panic!("Unrecognized type {}", t),
+    }
+}
+
+#[allow(clippy::type_complexity)]
+fn try_simple_param_val(
+    param_type: &str,
+) -> Box<dyn Fn(&mut token_stream::IntoIter) -> Option<String>> {
+    match param_type {
+        "bool" => Box::new(try_ident),
+        "str" => Box::new(|param_it| {
+            try_byte_string(param_it)
+                .map(|s| format!("kernel::module_param::StringParam::Ref(b\"{}\")", s))
+        }),
+        _ => Box::new(try_literal),
+    }
+}
+
+fn get_default(param_type: &ParamType, param_it: &mut token_stream::IntoIter) -> String {
+    let try_param_val = match param_type {
+        ParamType::Ident(ref param_type)
+        | ParamType::Array {
+            vals: ref param_type,
+            max_length: _,
+        } => try_simple_param_val(param_type),
+    };
+    assert_eq!(expect_ident(param_it), "default");
+    assert_eq!(expect_punct(param_it), ':');
+    let default = match param_type {
+        ParamType::Ident(_) => try_param_val(param_it).expect("Expected default param value"),
+        ParamType::Array {
+            vals: _,
+            max_length: _,
+        } => {
+            let group = expect_group(param_it);
+            assert_eq!(group.delimiter(), Delimiter::Bracket);
+            let mut default_vals = Vec::new();
+            let mut it = group.stream().into_iter();
+
+            while let Some(default_val) = try_param_val(&mut it) {
+                default_vals.push(default_val);
+                match it.next() {
+                    Some(TokenTree::Punct(punct)) => assert_eq!(punct.as_char(), ','),
+                    None => break,
+                    _ => panic!("Expected ',' or end of array default values"),
+                }
+            }
+
+            let mut default_array = "kernel::module_param::ArrayParam::create(&[".to_string();
+            default_array.push_str(
+                &default_vals
+                    .iter()
+                    .map(|val| val.to_string())
+                    .collect::<Vec<String>>()
+                    .join(","),
+            );
+            default_array.push_str("])");
+            default_array
+        }
+    };
+    assert_eq!(expect_punct(param_it), ',');
+    default
+}
+
+fn generated_array_ops_name(vals: &str, max_length: usize) -> String {
+    format!(
+        "__generated_array_ops_{vals}_{max_length}",
+        vals = vals,
+        max_length = max_length
+    )
 }
 
 #[derive(Debug, Default)]
@@ -96,15 +218,24 @@ struct ModuleInfo {
     name: String,
     author: Option<String>,
     description: Option<String>,
-    alias: Option<Vec<String>>,
+    alias: Option<String>,
+    params: Option<Group>,
 }
 
 impl ModuleInfo {
     fn parse(it: &mut token_stream::IntoIter) -> Self {
         let mut info = ModuleInfo::default();
 
-        const EXPECTED_KEYS: &[&str] =
-            &["type", "name", "author", "description", "license", "alias"];
+        const EXPECTED_KEYS: &[&str] = &[
+            "type",
+            "name",
+            "author",
+            "description",
+            "license",
+            "alias",
+            "alias_rtnl_link",
+            "params",
+        ];
         const REQUIRED_KEYS: &[&str] = &["type", "name", "license"];
         let mut seen_keys = Vec::new();
 
@@ -130,7 +261,11 @@ impl ModuleInfo {
                 "author" => info.author = Some(expect_string(it)),
                 "description" => info.description = Some(expect_string(it)),
                 "license" => info.license = expect_string_ascii(it),
-                "alias" => info.alias = Some(expect_string_array(it)),
+                "alias" => info.alias = Some(expect_string_ascii(it)),
+                "alias_rtnl_link" => {
+                    info.alias = Some(format!("rtnl-link-{}", expect_string_ascii(it)))
+                }
+                "params" => info.params = Some(expect_group(it)),
                 _ => panic!(
                     "Unknown key \"{}\". Valid keys are: {:?}.",
                     key, EXPECTED_KEYS
@@ -181,10 +316,8 @@ pub(crate) fn module(ts: TokenStream) -> TokenStream {
         modinfo.emit("description", &description);
     }
     modinfo.emit("license", &info.license);
-    if let Some(aliases) = info.alias {
-        for alias in aliases {
-            modinfo.emit("alias", &alias);
-        }
+    if let Some(alias) = info.alias {
+        modinfo.emit("alias", &alias);
     }
 
     // Built-in modules also export the `file` modinfo string.
@@ -192,6 +325,195 @@ pub(crate) fn module(ts: TokenStream) -> TokenStream {
         std::env::var("RUST_MODFILE").expect("Unable to fetch RUST_MODFILE environmental variable");
     modinfo.emit_only_builtin("file", &file);
 
+    let mut array_types_to_generate = Vec::new();
+    if let Some(params) = info.params {
+        assert_eq!(params.delimiter(), Delimiter::Brace);
+
+        let mut it = params.stream().into_iter();
+
+        loop {
+            let param_name = match it.next() {
+                Some(TokenTree::Ident(ident)) => ident.to_string(),
+                Some(_) => panic!("Expected Ident or end"),
+                None => break,
+            };
+
+            assert_eq!(expect_punct(&mut it), ':');
+            let param_type = expect_type(&mut it);
+            let group = expect_group(&mut it);
+            assert_eq!(expect_punct(&mut it), ',');
+
+            assert_eq!(group.delimiter(), Delimiter::Brace);
+
+            let mut param_it = group.stream().into_iter();
+            let param_default = get_default(&param_type, &mut param_it);
+            let param_permissions = get_literal(&mut param_it, "permissions");
+            let param_description = get_string(&mut param_it, "description");
+            expect_end(&mut param_it);
+
+            // TODO: More primitive types.
+            // TODO: Other kinds: unsafes, etc.
+            let (param_kernel_type, ops): (String, _) = match param_type {
+                ParamType::Ident(ref param_type) => (
+                    param_type.to_string(),
+                    param_ops_path(param_type).to_string(),
+                ),
+                ParamType::Array {
+                    ref vals,
+                    max_length,
+                } => {
+                    array_types_to_generate.push((vals.clone(), max_length));
+                    (
+                        format!("__rust_array_param_{}_{}", vals, max_length),
+                        generated_array_ops_name(vals, max_length),
+                    )
+                }
+            };
+
+            modinfo.emit_param("parmtype", &param_name, &param_kernel_type);
+            modinfo.emit_param("parm", &param_name, &param_description);
+            let param_type_internal = match param_type {
+                ParamType::Ident(ref param_type) => match param_type.as_ref() {
+                    "str" => "kernel::module_param::StringParam".to_string(),
+                    other => other.to_string(),
+                },
+                ParamType::Array {
+                    ref vals,
+                    max_length,
+                } => format!(
+                    "kernel::module_param::ArrayParam<{vals}, {max_length}>",
+                    vals = vals,
+                    max_length = max_length
+                ),
+            };
+            let read_func = if permissions_are_readonly(&param_permissions) {
+                format!(
+                    "
+                        fn read(&self)
+                            -> &<{param_type_internal} as kernel::module_param::ModuleParam>::Value {{
+                            // SAFETY: Parameters do not need to be locked because they are
+                            // read only or sysfs is not enabled.
+                            unsafe {{
+                                <{param_type_internal} as kernel::module_param::ModuleParam>::value(
+                                    &__{name}_{param_name}_value
+                                )
+                            }}
+                        }}
+                    ",
+                    name = info.name,
+                    param_name = param_name,
+                    param_type_internal = param_type_internal,
+                )
+            } else {
+                format!(
+                    "
+                        fn read<'lck>(&self, lock: &'lck kernel::KParamGuard)
+                            -> &'lck <{param_type_internal} as kernel::module_param::ModuleParam>::Value {{
+                            // SAFETY: Parameters are locked by `KParamGuard`.
+                            unsafe {{
+                                <{param_type_internal} as kernel::module_param::ModuleParam>::value(
+                                    &__{name}_{param_name}_value
+                                )
+                            }}
+                        }}
+                    ",
+                    name = info.name,
+                    param_name = param_name,
+                    param_type_internal = param_type_internal,
+                )
+            };
+            let kparam = format!(
+                "
+                    kernel::bindings::kernel_param__bindgen_ty_1 {{
+                        arg: unsafe {{ &__{name}_{param_name}_value }}
+                            as *const _ as *mut core::ffi::c_void,
+                    }},
+                ",
+                name = info.name,
+                param_name = param_name,
+            );
+            write!(
+                modinfo.buffer,
+                "
+                static mut __{name}_{param_name}_value: {param_type_internal} = {param_default};
+
+                struct __{name}_{param_name};
+
+                impl __{name}_{param_name} {{ {read_func} }}
+
+                const {param_name}: __{name}_{param_name} = __{name}_{param_name};
+
+                // Note: the C macro that generates the static structs for the `__param` section
+                // asks for them to be `aligned(sizeof(void *))`. However, that was put in place
+                // in 2003 in commit 38d5b085d2a0 (\"[PATCH] Fix over-alignment problem on x86-64\")
+                // to undo GCC over-alignment of static structs of >32 bytes. It seems that is
+                // not the case anymore, so we simplify to a transparent representation here
+                // in the expectation that it is not needed anymore.
+                // TODO: Revisit this to confirm the above comment and remove it if it happened.
+                #[repr(transparent)]
+                struct __{name}_{param_name}_RacyKernelParam(kernel::bindings::kernel_param);
+
+                unsafe impl Sync for __{name}_{param_name}_RacyKernelParam {{
+                }}
+
+                #[cfg(not(MODULE))]
+                const __{name}_{param_name}_name: *const core::ffi::c_char =
+                    b\"{name}.{param_name}\\0\" as *const _ as *const core::ffi::c_char;
+
+                #[cfg(MODULE)]
+                const __{name}_{param_name}_name: *const core::ffi::c_char =
+                    b\"{param_name}\\0\" as *const _ as *const core::ffi::c_char;
+
+                #[link_section = \"__param\"]
+                #[used]
+                static __{name}_{param_name}_struct: __{name}_{param_name}_RacyKernelParam =
+                    __{name}_{param_name}_RacyKernelParam(kernel::bindings::kernel_param {{
+                        name: __{name}_{param_name}_name,
+                        // SAFETY: `__this_module` is constructed by the kernel at load time
+                        // and will not be freed until the module is unloaded.
+                        #[cfg(MODULE)]
+                        mod_: unsafe {{ &kernel::bindings::__this_module as *const _ as *mut _ }},
+                        #[cfg(not(MODULE))]
+                        mod_: core::ptr::null_mut(),
+                        ops: unsafe {{ &{ops} }} as *const kernel::bindings::kernel_param_ops,
+                        perm: {permissions},
+                        level: -1,
+                        flags: 0,
+                        __bindgen_anon_1: {kparam}
+                    }});
+                ",
+                name = info.name,
+                param_type_internal = param_type_internal,
+                read_func = read_func,
+                param_default = param_default,
+                param_name = param_name,
+                ops = ops,
+                permissions = param_permissions,
+                kparam = kparam,
+            )
+            .unwrap();
+        }
+    }
+
+    let mut generated_array_types = String::new();
+
+    for (vals, max_length) in array_types_to_generate {
+        let ops_name = generated_array_ops_name(&vals, max_length);
+        write!(
+            generated_array_types,
+            "
+                kernel::make_param_ops!(
+                    {ops_name},
+                    kernel::module_param::ArrayParam<{vals}, {{ {max_length} }}>
+                );
+            ",
+            ops_name = ops_name,
+            vals = vals,
+            max_length = max_length,
+        )
+        .unwrap();
+    }
+
     format!(
         "
             /// The module name.
@@ -291,12 +613,44 @@ pub(crate) fn module(ts: TokenStream) -> TokenStream {
             }}
 
             {modinfo}
+
+            {generated_array_types}
         ",
         type_ = info.type_,
         name = info.name,
         modinfo = modinfo.buffer,
+        generated_array_types = generated_array_types,
         initcall_section = ".initcall6.init"
     )
     .parse()
     .expect("Error parsing formatted string into token stream.")
 }
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn test_permissions_are_readonly() {
+        assert!(permissions_are_readonly("0b000000000"));
+        assert!(permissions_are_readonly("0o000"));
+        assert!(permissions_are_readonly("000"));
+        assert!(permissions_are_readonly("0x000"));
+
+        assert!(!permissions_are_readonly("0b111111111"));
+        assert!(!permissions_are_readonly("0o777"));
+        assert!(!permissions_are_readonly("511"));
+        assert!(!permissions_are_readonly("0x1ff"));
+
+        assert!(permissions_are_readonly("0o014"));
+        assert!(permissions_are_readonly("0o015"));
+
+        assert!(!permissions_are_readonly("0o214"));
+        assert!(!permissions_are_readonly("0o024"));
+        assert!(!permissions_are_readonly("0o012"));
+
+        assert!(!permissions_are_readonly("0o315"));
+        assert!(!permissions_are_readonly("0o065"));
+        assert!(!permissions_are_readonly("0o017"));
+    }
+}
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC PATCH 06/11] rust: apply cache line padding for `SpinLock`
  2023-05-03  9:06 [RFC PATCH 00/11] Rust null block driver Andreas Hindborg
                   ` (4 preceding siblings ...)
  2023-05-03  9:07 ` [RFC PATCH 05/11] RUST: add `module_params` macro Andreas Hindborg
@ 2023-05-03  9:07 ` Andreas Hindborg
  2023-05-03 12:03   ` Alice Ryhl
  2023-05-03  9:07 ` [RFC PATCH 07/11] rust: lock: add support for `Lock::lock_irqsave` Andreas Hindborg
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 69+ messages in thread
From: Andreas Hindborg @ 2023-05-03  9:07 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block
  Cc: Andreas Hindborg, Matthew Wilcox, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	Benno Lossin, Andreas Hindborg, open list, gost.dev

From: Andreas Hindborg <a.hindborg@samsung.com>

The kernel `struct spinlock` is 4 bytes on x86 when lockdep is not enabled. The
structure is not padded to fit a cache line. The effect of this for `SpinLock`
is that the lock variable and the value protected by the lock will share a cache
line, depending on the alignment requirements of the protected value. Aligning
the lock variable and the protected value to a cache line yields a 20%
performance increase for the Rust null block driver for sequential reads to
memory backed devices at 6 concurrent readers.

Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com>
---
 rust/kernel/cache_padded.rs       | 33 +++++++++++++++++++++++++++++++
 rust/kernel/lib.rs                |  2 ++
 rust/kernel/sync/lock.rs          |  9 ++++++---
 rust/kernel/sync/lock/spinlock.rs | 13 ++++++++----
 4 files changed, 50 insertions(+), 7 deletions(-)
 create mode 100644 rust/kernel/cache_padded.rs

diff --git a/rust/kernel/cache_padded.rs b/rust/kernel/cache_padded.rs
new file mode 100644
index 000000000000..758678e71f50
--- /dev/null
+++ b/rust/kernel/cache_padded.rs
@@ -0,0 +1,33 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#[repr(align(64))]
+pub struct CachePadded<T: ?Sized> {
+    value: T,
+}
+
+unsafe impl<T: Send> Send for CachePadded<T> {}
+unsafe impl<T: Sync> Sync for CachePadded<T> {}
+
+impl<T> CachePadded<T> {
+    /// Pads and aligns a value to 64 bytes.
+    #[inline(always)]
+    pub(crate) const fn new(t: T) -> CachePadded<T> {
+        CachePadded::<T> { value: t }
+    }
+}
+
+impl<T: ?Sized> core::ops::Deref for CachePadded<T> {
+    type Target = T;
+
+    #[inline(always)]
+    fn deref(&self) -> &T {
+        &self.value
+    }
+}
+
+impl<T: ?Sized> core::ops::DerefMut for CachePadded<T> {
+    #[inline(always)]
+    fn deref_mut(&mut self) -> &mut T {
+        &mut self.value
+    }
+}
diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
index a0bd0b0e2aef..426e2dea0da6 100644
--- a/rust/kernel/lib.rs
+++ b/rust/kernel/lib.rs
@@ -37,6 +37,7 @@ extern crate self as kernel;
 mod allocator;
 pub mod block;
 mod build_assert;
+mod cache_padded;
 pub mod error;
 pub mod init;
 pub mod ioctl;
@@ -56,6 +57,7 @@ pub mod types;
 
 #[doc(hidden)]
 pub use bindings;
+pub(crate) use cache_padded::CachePadded;
 pub use macros;
 pub use uapi;
 
diff --git a/rust/kernel/sync/lock.rs b/rust/kernel/sync/lock.rs
index a2216325632d..1c584b1df30d 100644
--- a/rust/kernel/sync/lock.rs
+++ b/rust/kernel/sync/lock.rs
@@ -6,7 +6,9 @@
 //! spinlocks, raw spinlocks) to be provided with minimal effort.
 
 use super::LockClassKey;
-use crate::{bindings, init::PinInit, pin_init, str::CStr, types::Opaque, types::ScopeGuard};
+use crate::{
+    bindings, init::PinInit, pin_init, str::CStr, types::Opaque, types::ScopeGuard, CachePadded,
+};
 use core::{cell::UnsafeCell, marker::PhantomData, marker::PhantomPinned};
 use macros::pin_data;
 
@@ -87,7 +89,7 @@ pub struct Lock<T: ?Sized, B: Backend> {
     _pin: PhantomPinned,
 
     /// The data protected by the lock.
-    pub(crate) data: UnsafeCell<T>,
+    pub(crate) data: CachePadded<UnsafeCell<T>>,
 }
 
 // SAFETY: `Lock` can be transferred across thread boundaries iff the data it protects can.
@@ -102,7 +104,7 @@ impl<T, B: Backend> Lock<T, B> {
     #[allow(clippy::new_ret_no_self)]
     pub fn new(t: T, name: &'static CStr, key: &'static LockClassKey) -> impl PinInit<Self> {
         pin_init!(Self {
-            data: UnsafeCell::new(t),
+            data: CachePadded::new(UnsafeCell::new(t)),
             _pin: PhantomPinned,
             // SAFETY: `slot` is valid while the closure is called and both `name` and `key` have
             // static lifetimes so they live indefinitely.
@@ -115,6 +117,7 @@ impl<T, B: Backend> Lock<T, B> {
 
 impl<T: ?Sized, B: Backend> Lock<T, B> {
     /// Acquires the lock and gives the caller access to the data protected by it.
+    #[inline(always)]
     pub fn lock(&self) -> Guard<'_, T, B> {
         // SAFETY: The constructor of the type calls `init`, so the existence of the object proves
         // that `init` was called.
diff --git a/rust/kernel/sync/lock/spinlock.rs b/rust/kernel/sync/lock/spinlock.rs
index 979b56464a4e..e39142a8148c 100644
--- a/rust/kernel/sync/lock/spinlock.rs
+++ b/rust/kernel/sync/lock/spinlock.rs
@@ -4,7 +4,10 @@
 //!
 //! This module allows Rust code to use the kernel's `spinlock_t`.
 
+use core::ops::DerefMut;
+
 use crate::bindings;
+use crate::CachePadded;
 
 /// Creates a [`SpinLock`] initialiser with the given name and a newly-created lock class.
 ///
@@ -90,7 +93,7 @@ pub struct SpinLockBackend;
 // SAFETY: The underlying kernel `spinlock_t` object ensures mutual exclusion. `relock` uses the
 // default implementation that always calls the same locking method.
 unsafe impl super::Backend for SpinLockBackend {
-    type State = bindings::spinlock_t;
+    type State = CachePadded<bindings::spinlock_t>;
     type GuardState = ();
 
     unsafe fn init(
@@ -100,18 +103,20 @@ unsafe impl super::Backend for SpinLockBackend {
     ) {
         // SAFETY: The safety requirements ensure that `ptr` is valid for writes, and `name` and
         // `key` are valid for read indefinitely.
-        unsafe { bindings::__spin_lock_init(ptr, name, key) }
+        unsafe { bindings::__spin_lock_init((&mut *ptr).deref_mut(), name, key) }
     }
 
+    #[inline(always)]
     unsafe fn lock(ptr: *mut Self::State) -> Self::GuardState {
         // SAFETY: The safety requirements of this function ensure that `ptr` points to valid
         // memory, and that it has been initialised before.
-        unsafe { bindings::spin_lock(ptr) }
+        unsafe { bindings::spin_lock((&mut *ptr).deref_mut()) }
     }
 
+    #[inline(always)]
     unsafe fn unlock(ptr: *mut Self::State, _guard_state: &Self::GuardState) {
         // SAFETY: The safety requirements of this function ensure that `ptr` is valid and that the
         // caller is the owner of the mutex.
-        unsafe { bindings::spin_unlock(ptr) }
+        unsafe { bindings::spin_unlock((&mut *ptr).deref_mut()) }
     }
 }
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC PATCH 07/11] rust: lock: add support for `Lock::lock_irqsave`
  2023-05-03  9:06 [RFC PATCH 00/11] Rust null block driver Andreas Hindborg
                   ` (5 preceding siblings ...)
  2023-05-03  9:07 ` [RFC PATCH 06/11] rust: apply cache line padding for `SpinLock` Andreas Hindborg
@ 2023-05-03  9:07 ` Andreas Hindborg
  2023-05-03  9:07 ` [RFC PATCH 08/11] rust: lock: implement `IrqSaveBackend` for `SpinLock` Andreas Hindborg
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 69+ messages in thread
From: Andreas Hindborg @ 2023-05-03  9:07 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block
  Cc: Wedson Almeida Filho, Matthew Wilcox, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	Benno Lossin, Andreas Hindborg, Andreas Hindborg, open list,
	gost.dev

From: Wedson Almeida Filho <walmeida@microsoft.com>

This allows locks like spinlocks and raw spinlocks to expose a
`lock_irqsave` variant in Rust that corresponds to the C version.

Reviewed-by: Martin Rodriguez Reboredo <yakoyoku@gmail.com>
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
---
 rust/kernel/sync/lock.rs | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/rust/kernel/sync/lock.rs b/rust/kernel/sync/lock.rs
index 1c584b1df30d..bb21af8a8377 100644
--- a/rust/kernel/sync/lock.rs
+++ b/rust/kernel/sync/lock.rs
@@ -72,6 +72,44 @@ pub unsafe trait Backend {
     }
 }
 
+/// The "backend" of a lock that supports the irq-save variant.
+///
+/// # Safety
+///
+/// The same requirements wrt mutual exclusion in [`Backend`] apply for acquiring the lock via
+/// [`IrqSaveBackend::lock_irqsave`].
+///
+/// Additionally, when [`IrqSaveBackend::lock_irqsave`] is used to acquire the lock, implementers
+/// must disable interrupts on lock, and restore interrupt state on unlock. Implementers may use
+/// [`Backend::GuardState`] to store state needed to keep track of the interrupt state.
+pub unsafe trait IrqSaveBackend: Backend {
+    /// Acquires the lock, making the caller its owner.
+    ///
+    /// Before acquiring the lock, it disables interrupts, and returns the previous interrupt state
+    /// as its guard state so that the guard can restore it when it is dropped.
+    ///
+    /// # Safety
+    ///
+    /// Callers must ensure that [`Backend::init`] has been previously called.
+    #[must_use]
+    unsafe fn lock_irqsave(ptr: *mut Self::State) -> Self::GuardState;
+}
+
+impl<T: ?Sized, B: IrqSaveBackend> Lock<T, B> {
+    /// Acquires the lock and gives the caller access to the data protected by it.
+    ///
+    /// Before acquiring the lock, it disables interrupts. When the guard is dropped, the interrupt
+    /// state (either enabled or disabled) is restored to its state before
+    /// [`lock_irqsave`](Self::lock_irqsave) was called.
+    pub fn lock_irqsave(&self) -> Guard<'_, T, B> {
+        // SAFETY: The constructor of the type calls `init`, so the existence of the object proves
+        // that `init` was called.
+        let state = unsafe { B::lock_irqsave(self.state.get()) };
+        // SAFETY: The lock was just acquired.
+        unsafe { Guard::new(self, state) }
+    }
+}
+
 /// A mutual exclusion primitive.
 ///
 /// Exposes one of the kernel locking primitives. Which one is exposed depends on the lock backend
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC PATCH 08/11] rust: lock: implement `IrqSaveBackend` for `SpinLock`
  2023-05-03  9:06 [RFC PATCH 00/11] Rust null block driver Andreas Hindborg
                   ` (6 preceding siblings ...)
  2023-05-03  9:07 ` [RFC PATCH 07/11] rust: lock: add support for `Lock::lock_irqsave` Andreas Hindborg
@ 2023-05-03  9:07 ` Andreas Hindborg
  2023-05-03  9:07 ` [RFC PATCH 09/11] RUST: implement `ForeignOwnable` for `Pin` Andreas Hindborg
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 69+ messages in thread
From: Andreas Hindborg @ 2023-05-03  9:07 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block
  Cc: Wedson Almeida Filho, Matthew Wilcox, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	Benno Lossin, Andreas Hindborg, Andreas Hindborg, open list,
	gost.dev

From: Wedson Almeida Filho <walmeida@microsoft.com>

This allows Rust code to use the `lock_irqsave` variant of spinlocks.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Reviewed-by: Martin Rodriguez Reboredo <yakoyoku@gmail.com>
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
---
 rust/helpers.c                    | 16 ++++++++++++
 rust/kernel/sync/lock/spinlock.rs | 41 +++++++++++++++++++++++++------
 2 files changed, 50 insertions(+), 7 deletions(-)

diff --git a/rust/helpers.c b/rust/helpers.c
index a59341084774..7f5b95f652e1 100644
--- a/rust/helpers.c
+++ b/rust/helpers.c
@@ -203,6 +203,22 @@ struct page *rust_helper_alloc_pages(gfp_t gfp_mask, unsigned int order)
 }
 EXPORT_SYMBOL_GPL(rust_helper_alloc_pages);
 
+unsigned long rust_helper_spin_lock_irqsave(spinlock_t *lock)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(lock, flags);
+
+	return flags;
+}
+EXPORT_SYMBOL_GPL(rust_helper_spin_lock_irqsave);
+
+void rust_helper_spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags)
+{
+	spin_unlock_irqrestore(lock, flags);
+}
+EXPORT_SYMBOL_GPL(rust_helper_spin_unlock_irqrestore);
+
 /*
  * We use `bindgen`'s `--size_t-is-usize` option to bind the C `size_t` type
  * as the Rust `usize` type, so we can use it in contexts where Rust
diff --git a/rust/kernel/sync/lock/spinlock.rs b/rust/kernel/sync/lock/spinlock.rs
index e39142a8148c..50b8775bb49d 100644
--- a/rust/kernel/sync/lock/spinlock.rs
+++ b/rust/kernel/sync/lock/spinlock.rs
@@ -64,6 +64,8 @@ macro_rules! new_spinlock {
 /// assert_eq!(e.c, 10);
 /// assert_eq!(e.d.lock().a, 20);
 /// assert_eq!(e.d.lock().b, 30);
+/// assert_eq!(e.d.lock_irqsave().a, 20);
+/// assert_eq!(e.d.lock_irqsave().b, 30);
 /// ```
 ///
 /// The following example shows how to use interior mutability to modify the contents of a struct
@@ -81,6 +83,12 @@ macro_rules! new_spinlock {
 ///     let mut guard = m.lock();
 ///     guard.a += 10;
 ///     guard.b += 20;
+///
+/// fn example2(m: &SpinLock<Example>) {
+///     let mut guard = m.lock_irqsave();
+///     guard.a += 10;
+///     guard.b += 20;
+/// }
 /// }
 /// ```
 ///
@@ -94,7 +102,7 @@ pub struct SpinLockBackend;
 // default implementation that always calls the same locking method.
 unsafe impl super::Backend for SpinLockBackend {
     type State = CachePadded<bindings::spinlock_t>;
-    type GuardState = ();
+    type GuardState = Option<core::ffi::c_ulong>;
 
     unsafe fn init(
         ptr: *mut Self::State,
@@ -110,13 +118,32 @@ unsafe impl super::Backend for SpinLockBackend {
     unsafe fn lock(ptr: *mut Self::State) -> Self::GuardState {
         // SAFETY: The safety requirements of this function ensure that `ptr` points to valid
         // memory, and that it has been initialised before.
-        unsafe { bindings::spin_lock((&mut *ptr).deref_mut()) }
+        unsafe { bindings::spin_lock((&mut *ptr).deref_mut()) };
+        None
     }
 
-    #[inline(always)]
-    unsafe fn unlock(ptr: *mut Self::State, _guard_state: &Self::GuardState) {
-        // SAFETY: The safety requirements of this function ensure that `ptr` is valid and that the
-        // caller is the owner of the mutex.
-        unsafe { bindings::spin_unlock((&mut *ptr).deref_mut()) }
+    unsafe fn unlock(ptr: *mut Self::State, guard_state: &Self::GuardState) {
+        match guard_state {
+            // SAFETY: The safety requirements of this function ensure that `ptr` is valid and that
+            // the caller is the owner of the mutex.
+            Some(flags) => unsafe {
+                bindings::spin_unlock_irqrestore((&mut *ptr).deref_mut(), *flags)
+            },
+            // SAFETY: The safety requirements of this function ensure that `ptr` is valid and that
+            // the caller is the owner of the mutex.
+            None => unsafe { bindings::spin_unlock((&mut *ptr).deref_mut()) },
+        }
+    }
+}
+
+// SAFETY: The underlying kernel `spinlock_t` object ensures mutual exclusion. We use the `irqsave`
+// variant of the C lock acquisition functions to disable interrupts and retrieve the original
+// interrupt state, and the `irqrestore` variant of the lock release functions to restore the state
+// in `unlock` -- we use the guard context to determine which method was used to acquire the lock.
+unsafe impl super::IrqSaveBackend for SpinLockBackend {
+    unsafe fn lock_irqsave(ptr: *mut Self::State) -> Self::GuardState {
+        // SAFETY: The safety requirements of this function ensure that `ptr` points to valid
+        // memory, and that it has been initialised before.
+        Some(unsafe { bindings::spin_lock_irqsave((&mut *ptr).deref_mut()) })
     }
 }
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC PATCH 09/11] RUST: implement `ForeignOwnable` for `Pin`
  2023-05-03  9:06 [RFC PATCH 00/11] Rust null block driver Andreas Hindborg
                   ` (7 preceding siblings ...)
  2023-05-03  9:07 ` [RFC PATCH 08/11] rust: lock: implement `IrqSaveBackend` for `SpinLock` Andreas Hindborg
@ 2023-05-03  9:07 ` Andreas Hindborg
  2023-05-03  9:07 ` [RFC PATCH 10/11] rust: add null block driver Andreas Hindborg
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 69+ messages in thread
From: Andreas Hindborg @ 2023-05-03  9:07 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block
  Cc: Andreas Hindborg, Matthew Wilcox, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	Benno Lossin, Andreas Hindborg, open list, gost.dev

From: Andreas Hindborg <a.hindborg@samsung.com>

Implement `ForeignOwnable for Pin<T> where T: ForeignOwnable + Deref`.

Imported from rust tree [1]

[1] https://github.com/Rust-for-Linux/linux/tree/bc22545f38d74473cfef3e9fd65432733435b79f

Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
---
 rust/kernel/types.rs | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/rust/kernel/types.rs b/rust/kernel/types.rs
index 29db59d6119a..98e71e96a7fc 100644
--- a/rust/kernel/types.rs
+++ b/rust/kernel/types.rs
@@ -9,6 +9,7 @@ use core::{
     marker::PhantomData,
     mem::MaybeUninit,
     ops::{Deref, DerefMut},
+    pin::Pin,
     ptr::NonNull,
 };
 
@@ -100,6 +101,29 @@ impl ForeignOwnable for () {
     unsafe fn from_foreign(_: *const core::ffi::c_void) -> Self {}
 }
 
+impl<T: ForeignOwnable + Deref> ForeignOwnable for Pin<T> {
+    type Borrowed<'a> = T::Borrowed<'a>;
+
+    fn into_foreign(self) -> *const core::ffi::c_void {
+        // SAFETY: We continue to treat the pointer as pinned by returning just a pointer to it to
+        // the caller.
+        let inner = unsafe { Pin::into_inner_unchecked(self) };
+        inner.into_foreign()
+    }
+
+    unsafe fn borrow<'a>(ptr: *const core::ffi::c_void) -> Self::Borrowed<'a> {
+        // SAFETY: The safety requirements for this function are the same as the ones for
+        // `T::borrow`.
+        unsafe { T::borrow(ptr) }
+    }
+
+    unsafe fn from_foreign(p: *const core::ffi::c_void) -> Self {
+        // SAFETY: The object was originally pinned.
+        // The passed pointer comes from a previous call to `T::into_foreign`.
+        unsafe { Pin::new_unchecked(T::from_foreign(p)) }
+    }
+}
+
 /// Runs a cleanup function/closure when dropped.
 ///
 /// The [`ScopeGuard::dismiss`] function prevents the cleanup function from running.
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC PATCH 10/11] rust: add null block driver
  2023-05-03  9:06 [RFC PATCH 00/11] Rust null block driver Andreas Hindborg
                   ` (8 preceding siblings ...)
  2023-05-03  9:07 ` [RFC PATCH 09/11] RUST: implement `ForeignOwnable` for `Pin` Andreas Hindborg
@ 2023-05-03  9:07 ` Andreas Hindborg
  2023-05-03  9:07 ` [RFC PATCH 11/11] rust: inline a number of short functions Andreas Hindborg
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 69+ messages in thread
From: Andreas Hindborg @ 2023-05-03  9:07 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block
  Cc: Andreas Hindborg, Matthew Wilcox, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	Benno Lossin, Andreas Hindborg, open list, gost.dev

From: Andreas Hindborg <a.hindborg@samsung.com>

The driver is implemented entirely using safe Rust. This initial version
supports direct completion, read and write requests, blk-mq and optional memory
backing.

Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com>
---
 drivers/block/Kconfig         |   4 +
 drivers/block/Makefile        |   4 +
 drivers/block/rnull-helpers.c |  60 ++++++++++++
 drivers/block/rnull.rs        | 177 ++++++++++++++++++++++++++++++++++
 scripts/Makefile.build        |   2 +-
 5 files changed, 246 insertions(+), 1 deletion(-)
 create mode 100644 drivers/block/rnull-helpers.c
 create mode 100644 drivers/block/rnull.rs

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index f79f20430ef7..644ef1bc7574 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -354,6 +354,10 @@ config VIRTIO_BLK
 	  This is the virtual block driver for virtio.  It can be used with
           QEMU based VMMs (like KVM or Xen).  Say Y or M.
 
+config BLK_DEV_RS_NULL
+	tristate "Rust null block driver"
+	depends on RUST
+
 config BLK_DEV_RBD
 	tristate "Rados block device (RBD)"
 	depends on INET && BLOCK
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 101612cba303..cebbeece4bc2 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -9,6 +9,10 @@
 # needed for trace events
 ccflags-y				+= -I$(src)
 
+obj-$(CONFIG_BLK_DEV_RS_NULL) += rnull_mod.o
+rnull_mod-y := rnull-helpers.o rnull.o
+LLVM_LINK_FIX_drivers/block/rnull_mod.o := 1
+
 obj-$(CONFIG_MAC_FLOPPY)	+= swim3.o
 obj-$(CONFIG_BLK_DEV_SWIM)	+= swim_mod.o
 obj-$(CONFIG_BLK_DEV_FD)	+= floppy.o
diff --git a/drivers/block/rnull-helpers.c b/drivers/block/rnull-helpers.c
new file mode 100644
index 000000000000..826f2773ed93
--- /dev/null
+++ b/drivers/block/rnull-helpers.c
@@ -0,0 +1,60 @@
+
+#include <linux/bio.h>
+
+__attribute__((always_inline))
+void rust_helper_bio_advance_iter_single(const struct bio *bio,
+                                           struct bvec_iter *iter, unsigned int bytes)
+{
+	bio_advance_iter_single(bio, iter, bytes);
+}
+
+__attribute__((always_inline)) void *rust_helper_kmap(struct page *page)
+{
+	return kmap(page);
+}
+
+__attribute__((always_inline)) void rust_helper_kunmap(struct page *page)
+{
+	return kunmap(page);
+}
+
+__attribute__((always_inline)) void *rust_helper_kmap_atomic(struct page *page)
+{
+	return kmap_atomic(page);
+}
+
+__attribute__((always_inline)) void rust_helper_kunmap_atomic(void* address)
+{
+	kunmap_atomic(address);
+}
+
+__attribute__((always_inline)) struct page *
+rust_helper_alloc_pages(gfp_t gfp_mask, unsigned int order)
+{
+	return alloc_pages(gfp_mask, order);
+}
+
+__attribute__((always_inline)) void rust_helper_spin_lock_irq(spinlock_t *lock)
+{
+	spin_lock_irq(lock);
+}
+
+__attribute__((always_inline)) void
+rust_helper_spin_unlock_irq(spinlock_t *lock)
+{
+	spin_unlock_irq(lock);
+}
+__attribute__((always_inline)) unsigned long
+rust_helper_spin_lock_irqsave(spinlock_t *lock)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(lock, flags);
+
+	return flags;
+}
+__attribute__((always_inline)) void
+rust_helper_spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags)
+{
+	spin_unlock_irqrestore(lock, flags);
+}
diff --git a/drivers/block/rnull.rs b/drivers/block/rnull.rs
new file mode 100644
index 000000000000..d95025664a60
--- /dev/null
+++ b/drivers/block/rnull.rs
@@ -0,0 +1,177 @@
+// SPDX-License-Identifier: GPL-2.0
+
+//! This is a null block driver. It currently supports optional memory backing,
+//! blk-mq interface and direct completion. The driver is configured at module
+//! load time by parameters `memory_backed` and `capacity_mib`.
+
+use kernel::{
+    bindings,
+    block::{
+        bio::Segment,
+        mq::{self, GenDisk, Operations, TagSet},
+    },
+    error::Result,
+    macros::vtable,
+    new_mutex, new_spinlock,
+    pages::Pages,
+    pr_info,
+    prelude::*,
+    radix_tree::RadixTree,
+    sync::{Arc, Mutex, SpinLock},
+    types::ForeignOwnable,
+};
+
+module! {
+    type: NullBlkModule,
+    name: "rs_null_blk",
+    author: "Andreas Hindborg",
+    license: "GPL v2",
+    params: {
+        memory_backed: bool {
+            default: true,
+            permissions: 0,
+            description: "Use memory backing",
+        },
+        capacity_mib: u64 {
+            default: 4096,
+            permissions: 0,
+            description: "Device capacity in MiB",
+        },
+    },
+}
+
+struct NullBlkModule {
+    _disk: Pin<Box<Mutex<GenDisk<NullBlkDevice>>>>,
+}
+
+fn add_disk(tagset: Arc<TagSet<NullBlkDevice>>) -> Result<GenDisk<NullBlkDevice>> {
+    let tree = RadixTree::new()?;
+    let queue_data = Box::pin_init(new_spinlock!(tree, "rnullb:mem"))?;
+
+    let disk = GenDisk::try_new(tagset, queue_data)?;
+    disk.set_name(format_args!("rnullb{}", 0))?;
+    disk.set_capacity(*capacity_mib.read() << 11);
+    disk.set_queue_logical_block_size(4096);
+    disk.set_queue_physical_block_size(4096);
+    disk.set_rotational(false);
+    Ok(disk)
+}
+
+impl kernel::Module for NullBlkModule {
+    fn init(_module: &'static ThisModule) -> Result<Self> {
+        pr_info!("Rust null_blk loaded\n");
+        // Major device number?
+        let tagset = TagSet::try_new(1, (), 256, 1)?;
+        let disk = Box::pin_init(new_mutex!(add_disk(tagset)?, "nullb:disk"))?;
+
+        disk.lock().add()?;
+
+        Ok(Self { _disk: disk })
+    }
+}
+
+impl Drop for NullBlkModule {
+    fn drop(&mut self) {
+        pr_info!("Dropping rnullb\n");
+    }
+}
+
+struct NullBlkDevice;
+type Tree = kernel::radix_tree::RadixTree<Box<Pages<0>>>;
+type Data = Pin<Box<SpinLock<Tree>>>;
+
+impl NullBlkDevice {
+    #[inline(always)]
+    fn write(tree: &mut Tree, sector: usize, segment: &Segment<'_>) -> Result {
+        let idx = sector >> 3; // TODO: PAGE_SECTOR_SHIFT
+        let mut page = if let Some(page) = tree.get_mut(idx as u64) {
+            page
+        } else {
+            tree.try_insert(idx as u64, Box::try_new(Pages::new()?)?)?;
+            tree.get_mut(idx as u64).unwrap()
+        };
+
+        segment.copy_to_page_atomic(&mut page)?;
+
+        Ok(())
+    }
+
+    #[inline(always)]
+    fn read(tree: &mut Tree, sector: usize, segment: &mut Segment<'_>) -> Result {
+        let idx = sector >> 3; // TODO: PAGE_SECTOR_SHIFT
+        if let Some(page) = tree.get(idx as u64) {
+            segment.copy_from_page_atomic(page)?;
+        }
+
+        Ok(())
+    }
+
+    #[inline(never)]
+    fn transfer(
+        command: bindings::req_op,
+        tree: &mut Tree,
+        sector: usize,
+        segment: &mut Segment<'_>,
+    ) -> Result {
+        match command {
+            bindings::req_op_REQ_OP_WRITE => Self::write(tree, sector, segment)?,
+            bindings::req_op_REQ_OP_READ => Self::read(tree, sector, segment)?,
+            _ => (),
+        }
+        Ok(())
+    }
+}
+
+#[vtable]
+impl Operations for NullBlkDevice {
+    type RequestData = ();
+    type QueueData = Data;
+    type HwData = ();
+    type TagSetData = ();
+
+    fn new_request_data(
+        _tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
+    ) -> Result<Self::RequestData> {
+        Ok(())
+    }
+
+    #[inline(always)]
+    fn queue_rq(
+        _hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>,
+        queue_data: <Self::QueueData as ForeignOwnable>::Borrowed<'_>,
+        rq: &mq::Request<Self>,
+        _is_last: bool,
+    ) -> Result {
+        rq.start();
+        if *memory_backed.read() {
+            let mut tree = queue_data.lock_irqsave();
+
+            let mut sector = rq.sector();
+            for bio in rq.bio_iter() {
+                for mut segment in bio.segment_iter() {
+                    let _ = Self::transfer(rq.command(), &mut tree, sector, &mut segment);
+                    sector += segment.len() >> 9; // TODO: SECTOR_SHIFT
+                }
+            }
+        }
+        rq.end_ok();
+        Ok(())
+    }
+
+    fn commit_rqs(
+        _hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>,
+        _queue_data: <Self::QueueData as ForeignOwnable>::Borrowed<'_>,
+    ) {
+    }
+
+    fn complete(_rq: &mq::Request<Self>) {
+        //rq.end_ok();
+    }
+
+    fn init_hctx(
+        _tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
+        _hctx_idx: u32,
+    ) -> Result<Self::HwData> {
+        Ok(())
+    }
+}
diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index 9f94fc83f086..94127fc3cf77 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -277,7 +277,7 @@ $(obj)/%.lst: $(src)/%.c FORCE
 # Compile Rust sources (.rs)
 # ---------------------------------------------------------------------------
 
-rust_allowed_features := core_ffi_c,explicit_generic_args_with_impl_trait,new_uninit,pin_macro
+rust_allowed_features := core_ffi_c,explicit_generic_args_with_impl_trait,new_uninit,pin_macro,allocator_api
 
 rust_common_cmd = \
 	RUST_MODFILE=$(modfile) $(RUSTC_OR_CLIPPY) $(rust_flags) \
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [RFC PATCH 11/11] rust: inline a number of short functions
  2023-05-03  9:06 [RFC PATCH 00/11] Rust null block driver Andreas Hindborg
                   ` (9 preceding siblings ...)
  2023-05-03  9:07 ` [RFC PATCH 10/11] rust: add null block driver Andreas Hindborg
@ 2023-05-03  9:07 ` Andreas Hindborg
  2023-05-03 11:32 ` [RFC PATCH 00/11] Rust null block driver Niklas Cassel
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 69+ messages in thread
From: Andreas Hindborg @ 2023-05-03  9:07 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block
  Cc: Andreas Hindborg, Matthew Wilcox, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	Benno Lossin, Andreas Hindborg, open list, gost.dev

From: Andreas Hindborg <a.hindborg@samsung.com>

The rust compiler will not inline functions that live in vmlinux when
building modules. Add inline directives to these short functions to ensure that
they are inlined when building modules.

Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com>
---
 rust/kernel/sync/lock.rs          | 2 ++
 rust/kernel/sync/lock/mutex.rs    | 2 ++
 rust/kernel/sync/lock/spinlock.rs | 2 ++
 rust/kernel/types.rs              | 6 ++++++
 4 files changed, 12 insertions(+)

diff --git a/rust/kernel/sync/lock.rs b/rust/kernel/sync/lock.rs
index bb21af8a8377..4bfc2f5d9841 100644
--- a/rust/kernel/sync/lock.rs
+++ b/rust/kernel/sync/lock.rs
@@ -101,6 +101,7 @@ impl<T: ?Sized, B: IrqSaveBackend> Lock<T, B> {
     /// Before acquiring the lock, it disables interrupts. When the guard is dropped, the interrupt
     /// state (either enabled or disabled) is restored to its state before
     /// [`lock_irqsave`](Self::lock_irqsave) was called.
+    #[inline(always)]
     pub fn lock_irqsave(&self) -> Guard<'_, T, B> {
         // SAFETY: The constructor of the type calls `init`, so the existence of the object proves
         // that `init` was called.
@@ -210,6 +211,7 @@ impl<T: ?Sized, B: Backend> core::ops::DerefMut for Guard<'_, T, B> {
 }
 
 impl<T: ?Sized, B: Backend> Drop for Guard<'_, T, B> {
+    #[inline(always)]
     fn drop(&mut self) {
         // SAFETY: The caller owns the lock, so it is safe to unlock it.
         unsafe { B::unlock(self.lock.state.get(), &self.state) };
diff --git a/rust/kernel/sync/lock/mutex.rs b/rust/kernel/sync/lock/mutex.rs
index 923472f04af4..5e8096811b98 100644
--- a/rust/kernel/sync/lock/mutex.rs
+++ b/rust/kernel/sync/lock/mutex.rs
@@ -104,12 +104,14 @@ unsafe impl super::Backend for MutexBackend {
         unsafe { bindings::__mutex_init(ptr, name, key) }
     }
 
+    #[inline(always)]
     unsafe fn lock(ptr: *mut Self::State) -> Self::GuardState {
         // SAFETY: The safety requirements of this function ensure that `ptr` points to valid
         // memory, and that it has been initialised before.
         unsafe { bindings::mutex_lock(ptr) };
     }
 
+    #[inline(always)]
     unsafe fn unlock(ptr: *mut Self::State, _guard_state: &Self::GuardState) {
         // SAFETY: The safety requirements of this function ensure that `ptr` is valid and that the
         // caller is the owner of the mutex.
diff --git a/rust/kernel/sync/lock/spinlock.rs b/rust/kernel/sync/lock/spinlock.rs
index 50b8775bb49d..23a973dab85c 100644
--- a/rust/kernel/sync/lock/spinlock.rs
+++ b/rust/kernel/sync/lock/spinlock.rs
@@ -122,6 +122,7 @@ unsafe impl super::Backend for SpinLockBackend {
         None
     }
 
+    #[inline(always)]
     unsafe fn unlock(ptr: *mut Self::State, guard_state: &Self::GuardState) {
         match guard_state {
             // SAFETY: The safety requirements of this function ensure that `ptr` is valid and that
@@ -141,6 +142,7 @@ unsafe impl super::Backend for SpinLockBackend {
 // interrupt state, and the `irqrestore` variant of the lock release functions to restore the state
 // in `unlock` -- we use the guard context to determine which method was used to acquire the lock.
 unsafe impl super::IrqSaveBackend for SpinLockBackend {
+    #[inline(always)]
     unsafe fn lock_irqsave(ptr: *mut Self::State) -> Self::GuardState {
         // SAFETY: The safety requirements of this function ensure that `ptr` points to valid
         // memory, and that it has been initialised before.
diff --git a/rust/kernel/types.rs b/rust/kernel/types.rs
index 98e71e96a7fc..7be1f64bbde9 100644
--- a/rust/kernel/types.rs
+++ b/rust/kernel/types.rs
@@ -70,10 +70,12 @@ pub trait ForeignOwnable: Sized {
 impl<T: 'static> ForeignOwnable for Box<T> {
     type Borrowed<'a> = &'a T;
 
+    #[inline(always)]
     fn into_foreign(self) -> *const core::ffi::c_void {
         Box::into_raw(self) as _
     }
 
+    #[inline(always)]
     unsafe fn borrow<'a>(ptr: *const core::ffi::c_void) -> &'a T {
         // SAFETY: The safety requirements for this function ensure that the object is still alive,
         // so it is safe to dereference the raw pointer.
@@ -82,6 +84,7 @@ impl<T: 'static> ForeignOwnable for Box<T> {
         unsafe { &*ptr.cast() }
     }
 
+    #[inline(always)]
     unsafe fn from_foreign(ptr: *const core::ffi::c_void) -> Self {
         // SAFETY: The safety requirements of this function ensure that `ptr` comes from a previous
         // call to `Self::into_foreign`.
@@ -92,12 +95,15 @@ impl<T: 'static> ForeignOwnable for Box<T> {
 impl ForeignOwnable for () {
     type Borrowed<'a> = ();
 
+    #[inline(always)]
     fn into_foreign(self) -> *const core::ffi::c_void {
         core::ptr::NonNull::dangling().as_ptr()
     }
 
+    #[inline(always)]
     unsafe fn borrow<'a>(_: *const core::ffi::c_void) -> Self::Borrowed<'a> {}
 
+    #[inline(always)]
     unsafe fn from_foreign(_: *const core::ffi::c_void) -> Self {}
 }
 
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 01/11] rust: add radix tree abstraction
  2023-05-03  9:06 ` [RFC PATCH 01/11] rust: add radix tree abstraction Andreas Hindborg
@ 2023-05-03 10:34   ` Benno Lossin
  2023-05-05  4:04   ` Matthew Wilcox
  1 sibling, 0 replies; 69+ messages in thread
From: Benno Lossin @ 2023-05-03 10:34 UTC (permalink / raw)
  To: Andreas Hindborg
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Andreas Hindborg, Matthew Wilcox, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	linux-kernel, gost.dev

> From: Andreas Hindborg <a.hindborg@samsung.com>
> 
> Add abstractions for the C radix_tree. This abstraction allows Rust code to use
> the radix_tree as a map from `u64` keys to `ForeignOwnable` values.
> 
> Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com>
> ---
>  rust/bindings/bindings_helper.h |   2 +
>  rust/bindings/lib.rs            |   1 +
>  rust/helpers.c                  |  22 +++++
>  rust/kernel/lib.rs              |   1 +
>  rust/kernel/radix_tree.rs       | 156 ++++++++++++++++++++++++++++++++
>  5 files changed, 182 insertions(+)
>  create mode 100644 rust/kernel/radix_tree.rs
> 
> diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
> index 50e7a76d5455..52834962b94d 100644
> --- a/rust/bindings/bindings_helper.h
> +++ b/rust/bindings/bindings_helper.h
> @@ -10,7 +10,9 @@
>  #include <linux/refcount.h>
>  #include <linux/wait.h>
>  #include <linux/sched.h>
> +#include <linux/radix-tree.h>

Can you sort these alphabetically? Also the constants down below.

> 
>  /* `bindgen` gets confused at certain things. */
>  const gfp_t BINDINGS_GFP_KERNEL = GFP_KERNEL;
> +const gfp_t BINDINGS_GFP_ATOMIC = GFP_ATOMIC;
>  const gfp_t BINDINGS___GFP_ZERO = __GFP_ZERO;
> diff --git a/rust/bindings/lib.rs b/rust/bindings/lib.rs
> index 7b246454e009..62f36a9eb1f4 100644
> --- a/rust/bindings/lib.rs
> +++ b/rust/bindings/lib.rs
> @@ -51,4 +51,5 @@ mod bindings_helper {
>  pub use bindings_raw::*;
> 
>  pub const GFP_KERNEL: gfp_t = BINDINGS_GFP_KERNEL;
> +pub const GFP_ATOMIC: gfp_t = BINDINGS_GFP_ATOMIC;
>  pub const __GFP_ZERO: gfp_t = BINDINGS___GFP_ZERO;
> diff --git a/rust/helpers.c b/rust/helpers.c
> index 81e80261d597..5dd5e325b7cc 100644
> --- a/rust/helpers.c
> +++ b/rust/helpers.c
> @@ -26,6 +26,7 @@
>  #include <linux/spinlock.h>
>  #include <linux/sched/signal.h>
>  #include <linux/wait.h>
> +#include <linux/radix-tree.h>
> 
>  __noreturn void rust_helper_BUG(void)
>  {
> @@ -128,6 +129,27 @@ void rust_helper_put_task_struct(struct task_struct *t)
>  }
>  EXPORT_SYMBOL_GPL(rust_helper_put_task_struct);
> 
> +void rust_helper_init_radix_tree(struct xarray *tree, gfp_t gfp_mask)
> +{
> +	INIT_RADIX_TREE(tree, gfp_mask);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_init_radix_tree);
> +
> +void **rust_helper_radix_tree_iter_init(struct radix_tree_iter *iter,
> +					unsigned long start)
> +{
> +	return radix_tree_iter_init(iter, start);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_radix_tree_iter_init);
> +
> +void **rust_helper_radix_tree_next_slot(void **slot,
> +					struct radix_tree_iter *iter,
> +					unsigned flags)
> +{
> +	return radix_tree_next_slot(slot, iter, flags);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_radix_tree_next_slot);
> +
>  /*
>   * We use `bindgen`'s `--size_t-is-usize` option to bind the C `size_t` type
>   * as the Rust `usize` type, so we can use it in contexts where Rust
> diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
> index 676995d4e460..a85cb6aae8d6 100644
> --- a/rust/kernel/lib.rs
> +++ b/rust/kernel/lib.rs
> @@ -40,6 +40,7 @@ pub mod init;
>  pub mod ioctl;
>  pub mod prelude;
>  pub mod print;
> +pub mod radix_tree;
>  mod static_assert;
>  #[doc(hidden)]
>  pub mod std_vendor;
> diff --git a/rust/kernel/radix_tree.rs b/rust/kernel/radix_tree.rs
> new file mode 100644
> index 000000000000..f659ab8b017c
> --- /dev/null
> +++ b/rust/kernel/radix_tree.rs
> @@ -0,0 +1,156 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +//! RadixTree abstraction.
> +//!
> +//! C header: [`include/linux/radix_tree.h`](../../include/linux/radix_tree.h)
> +
> +use crate::error::to_result;
> +use crate::error::Result;
> +use crate::types::ForeignOwnable;
> +use crate::types::Opaque;
> +use crate::types::ScopeGuard;
> +use alloc::boxed::Box;
> +use core::marker::PhantomData;
> +use core::pin::Pin;

I would prefer `use crate::{error::{...}, types::{...}};`.

> +
> +type Key = u64;

Is there a reason to not make this `pub`?

> +
> +/// A map of `u64` to `ForeignOwnable`
> +///
> +/// # Invariants
> +///
> +/// - `tree` always points to a valid and initialized `struct radix_tree`.
> +/// - Pointers stored in the tree are created by a call to `ForignOwnable::into_foreign()`
> +pub struct RadixTree<V: ForeignOwnable> {
> +    tree: Pin<Box<Opaque<bindings::xarray>>>,
> +    _marker: PhantomData<V>,
> +}

Design question: why is the tree boxed? Shouldn't the user decide how
they want to manage the memory of the tree? In other words, should
`tree` just be `Opaque<bindings::xarray>` and this type initialized via
`pin-init`?

> +
> +impl<V: ForeignOwnable> RadixTree<V> {
> +    /// Create a new radix tree
> +    ///
> +    /// Note: This function allocates memory with `GFP_ATOMIC`.

There probably is a reason why this uses `GFP_ATOMIC`, but I think if we
decide that the tree allocates memory there should be also a function
with `GFP_KERNEL`. That function should be called `new()` and this one
`new_atomic()` or `new_gfp(gfp: gfp_t)`. Also note that I think the
return type should be `Result<Self, AllocError>`, we should be explicit
where we can be.

> +    pub fn new() -> Result<Self> {
> +        let tree = Pin::from(Box::try_new(Opaque::uninit())?);
> +
> +        // SAFETY: `tree` points to allocated but not initialized memory. This
> +        // call will initialize the memory.
> +        unsafe { bindings::init_radix_tree(tree.get(), bindings::GFP_ATOMIC) };
> +
> +        Ok(Self {
> +            tree,
> +            _marker: PhantomData,
> +        })
> +    }
> +
> +    /// Try to insert a value into the tree
> +    pub fn try_insert(&mut self, key: Key, value: V) -> Result<()> {
> +        // SAFETY: `self.tree` points to a valid and initialized `struct radix_tree`

Also add that the invariant of only inserting `ForignOwnable` pointers
is upheld.

> +        let ret =
> +            unsafe { bindings::radix_tree_insert(self.tree.get(), key, value.into_foreign() as _) };

Instead of `as _` prefer to use `.cast::<$type>()` or in this case probably `.cast_mut()`.

> +        to_result(ret)
> +    }
> +
> +    /// Search for `key` in the map. Returns a reference to the associated
> +    /// value if found.
> +    pub fn get(&self, key: Key) -> Option<V::Borrowed<'_>> {
> +        // SAFETY: `self.tree` points to a valid and initialized `struct radix_tree`
> +        let item =
> +            core::ptr::NonNull::new(unsafe { bindings::radix_tree_lookup(self.tree.get(), key) })?;

You use `NonNull` quiet a bit, consider importing it.

> +
> +        // SAFETY: `item` was created by a call to
> +        // `ForeignOwnable::into_foreign()`. As `get_mut()` and `remove()` takes
> +        // a `&mut self`, no mutable borrows for `item` can exist and
> +        // `ForeignOwnable::from_foreign()` cannot be called until this borrow
> +        // is dropped.
> +        Some(unsafe { V::borrow(item.as_ptr()) })
> +    }
> +
> +    /// Search for `key` in the map. Return a mutable reference to the
> +    /// associated value if found.
> +    pub fn get_mut(&mut self, key: Key) -> Option<MutBorrow<'_, V>> {
> +        let item =
> +            core::ptr::NonNull::new(unsafe { bindings::radix_tree_lookup(self.tree.get(), key) })?;
> +
> +        // SAFETY: `item` was created by a call to
> +        // `ForeignOwnable::into_foreign()`. As `get()` takes a `&self` and
> +        // `remove()` takes a `&mut self`, no borrows for `item` can exist and
> +        // `ForeignOwnable::from_foreign()` cannot be called until this borrow
> +        // is dropped.
> +        Some(MutBorrow {
> +            guard: unsafe { V::borrow_mut(item.as_ptr()) },
> +            _marker: core::marker::PhantomData,
> +        })
> +    }
> +
> +    /// Search for `key` in the map. If `key` is found, the key and value is
> +    /// removed from the map and the value is returned.
> +    pub fn remove(&mut self, key: Key) -> Option<V> {
> +        // SAFETY: `self.tree` points to a valid and initialized `struct radix_tree`
> +        let item =
> +            core::ptr::NonNull::new(unsafe { bindings::radix_tree_delete(self.tree.get(), key) })?;
> +
> +        // SAFETY: `item` was created by a call to
> +        // `ForeignOwnable::into_foreign()` and no borrows to `item` can exist
> +        // because this function takes a `&mut self`.
> +        Some(unsafe { ForeignOwnable::from_foreign(item.as_ptr()) })
> +    }
> +}
> +
> +impl<V: ForeignOwnable> Drop for RadixTree<V> {
> +    fn drop(&mut self) {
> +        let mut iter = bindings::radix_tree_iter {
> +            index: 0,
> +            next_index: 0,
> +            tags: 0,
> +            node: core::ptr::null_mut(),
> +        };
> +
> +        // SAFETY: Iter is valid as we allocated it on the stack above
> +        let mut slot = unsafe { bindings::radix_tree_iter_init(&mut iter, 0) };
> +        loop {
> +            if slot.is_null() {
> +                // SAFETY: Both `self.tree` and `iter` are valid
> +                slot = unsafe { bindings::radix_tree_next_chunk(self.tree.get(), &mut iter, 0) };
> +            }
> +
> +            if slot.is_null() {
> +                break;
> +            }
> +
> +            // SAFETY: `self.tree` is valid and iter is managed by
> +            // `radix_tree_next_chunk()` and `radix_tree_next_slot()`
> +            let item = unsafe { bindings::radix_tree_delete(self.tree.get(), iter.index) };
> +            assert!(!item.is_null());
> +
> +            // SAFETY: All items in the tree are created by a call to
> +            // `ForeignOwnable::into_foreign()`.
> +            let _ = unsafe { V::from_foreign(item) };
> +
> +            // SAFETY: `self.tree` is valid and iter is managed by
> +            // `radix_tree_next_chunk()` and `radix_tree_next_slot()`. Slot is
> +            // not null.
> +            slot = unsafe { bindings::radix_tree_next_slot(slot, &mut iter, 0) };
> +        }
> +    }
> +}
> +
> +/// A mutable borrow of an object owned by a `RadixTree`
> +pub struct MutBorrow<'a, V: ForeignOwnable> {
> +    guard: ScopeGuard<V, fn(V)>,
> +    _marker: core::marker::PhantomData<&'a mut V>,
> +}
> +
> +impl<'a, V: ForeignOwnable> core::ops::Deref for MutBorrow<'a, V> {
> +    type Target = ScopeGuard<V, fn(V)>;

Why is `Target = ScopeGuard`? I think it makes more sense to use
`Target = V`.

> +
> +    fn deref(&self) -> &Self::Target {
> +        &self.guard
> +    }
> +}
> +
> +impl<'a, V: ForeignOwnable> core::ops::DerefMut for MutBorrow<'a, V> {
> +    fn deref_mut(&mut self) -> &mut Self::Target {
> +        &mut self.guard
> +    }
> +}
> --
> 2.40.0
> 

--
Cheers,
Benno

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
  2023-05-03  9:06 [RFC PATCH 00/11] Rust null block driver Andreas Hindborg
                   ` (10 preceding siblings ...)
  2023-05-03  9:07 ` [RFC PATCH 11/11] rust: inline a number of short functions Andreas Hindborg
@ 2023-05-03 11:32 ` Niklas Cassel
  2023-05-03 12:29   ` Andreas Hindborg
  2023-05-03 16:47 ` Bart Van Assche
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 69+ messages in thread
From: Niklas Cassel @ 2023-05-03 11:32 UTC (permalink / raw)
  To: Andreas Hindborg
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Andreas Hindborg, Matthew Wilcox, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	Benno Lossin, open list, gost.dev

On Wed, May 03, 2023 at 11:06:57AM +0200, Andreas Hindborg wrote:
> From: Andreas Hindborg <a.hindborg@samsung.com>
> 

(cut)

> 
> For each measurement the drivers are loaded, a drive is configured with memory
> backing and a size of 4 GiB. C null_blk is configured to match the implemented
> modes of the Rust driver: `blocksize` is set to 4 KiB, `completion_nsec` to 0,
> `irqmode` to 0 (IRQ_NONE), `queue_mode` to 2 (MQ), `hw_queue_depth` to 256 and
> `memory_backed` to 1. For both the drivers, the queue scheduler is set to
> `none`. These measurements are made using 30 second runs of `fio` with the
> `PSYNC` IO engine with workers pinned to separate CPU cores. The measurements
> are done inside a virtual machine (qemu/kvm) on an Intel Alder Lake workstation
> (i5-12600).

Hello Andreas,

I'm curious why you used psync ioengine for the benchmarks.

As psync is a sync ioengine, it means queue depth == 1.

Wouldn't it have been more interesting to see an async ioengine,
together with different queue depths?


You might want to explain your table a bit more.
It might be nice to see IOPS and average latencies.

As an example of a table that I find easier to interpret,
see e.g. the table on page 29 in the SPDK performance report:
https://ci.spdk.io/download/performance-reports/SPDK_nvme_bdev_perf_report_2301.pdf


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 06/11] rust: apply cache line padding for `SpinLock`
  2023-05-03  9:07 ` [RFC PATCH 06/11] rust: apply cache line padding for `SpinLock` Andreas Hindborg
@ 2023-05-03 12:03   ` Alice Ryhl
  2024-02-23 11:29     ` Andreas Hindborg (Samsung)
  0 siblings, 1 reply; 69+ messages in thread
From: Alice Ryhl @ 2023-05-03 12:03 UTC (permalink / raw)
  To: nmi
  Cc: Damien.LeMoal, a.hindborg, alex.gaynor, axboe, benno.lossin,
	bjorn3_gh, boqun.feng, gary, gost.dev, hare, hch, kbusch,
	linux-block, linux-kernel, lsf-pc, ojeda, rust-for-linux,
	wedsonaf, willy

On Wed, 3 May 2023 11:07:03 +0200, Andreas Hindborg <a.hindborg@samsung.com> wrote:
> The kernel `struct spinlock` is 4 bytes on x86 when lockdep is not enabled. The
> structure is not padded to fit a cache line. The effect of this for `SpinLock`
> is that the lock variable and the value protected by the lock will share a cache
> line, depending on the alignment requirements of the protected value. Aligning
> the lock variable and the protected value to a cache line yields a 20%
> performance increase for the Rust null block driver for sequential reads to
> memory backed devices at 6 concurrent readers.
> 
> Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com>

This applies the cacheline padding to all spinlocks unconditionally.
It's not clear to me that we want to do that. Instead, I suggest using
`SpinLock<CachePadded<T>>` in the null block driver to opt-in to the
cache padding there, and let other drivers choose whether or not they
want to cache pad their locks.

On Wed, 3 May 2023 11:07:03 +0200, Andreas Hindborg <a.hindborg@samsung.com> wrote:
> diff --git a/rust/kernel/cache_padded.rs b/rust/kernel/cache_padded.rs
> new file mode 100644
> index 000000000000..758678e71f50
> --- /dev/null
> +++ b/rust/kernel/cache_padded.rs
> 
> +impl<T> CachePadded<T> {
> +    /// Pads and aligns a value to 64 bytes.
> +    #[inline(always)]
> +    pub(crate) const fn new(t: T) -> CachePadded<T> {
> +        CachePadded::<T> { value: t }
> +    }
> +}

Please make this `pub` instead of just `pub(crate)`. Other drivers might
want to use this directly.

On Wed, 3 May 2023 11:07:03 +0200, Andreas Hindborg <a.hindborg@samsung.com> wrote:
> diff --git a/rust/kernel/sync/lock/spinlock.rs b/rust/kernel/sync/lock/spinlock.rs
> index 979b56464a4e..e39142a8148c 100644
> --- a/rust/kernel/sync/lock/spinlock.rs
> +++ b/rust/kernel/sync/lock/spinlock.rs
> @@ -100,18 +103,20 @@ unsafe impl super::Backend for SpinLockBackend {
>      ) {
>          // SAFETY: The safety requirements ensure that `ptr` is valid for writes, and `name` and
>          // `key` are valid for read indefinitely.
> -        unsafe { bindings::__spin_lock_init(ptr, name, key) }
> +        unsafe { bindings::__spin_lock_init((&mut *ptr).deref_mut(), name, key) }
>      }
>  
> +    #[inline(always)]
>      unsafe fn lock(ptr: *mut Self::State) -> Self::GuardState {
>          // SAFETY: The safety requirements of this function ensure that `ptr` points to valid
>          // memory, and that it has been initialised before.
> -        unsafe { bindings::spin_lock(ptr) }
> +        unsafe { bindings::spin_lock((&mut *ptr).deref_mut()) }
>      }
>  
> +    #[inline(always)]
>      unsafe fn unlock(ptr: *mut Self::State, _guard_state: &Self::GuardState) {
>          // SAFETY: The safety requirements of this function ensure that `ptr` is valid and that the
>          // caller is the owner of the mutex.
> -        unsafe { bindings::spin_unlock(ptr) }
> +        unsafe { bindings::spin_unlock((&mut *ptr).deref_mut()) }
>      }
>  }

I would prefer to remain in pointer-land for the above operations. I
think that this leads to core that is more obviously correct.

For example:

```
impl<T> CachePadded<T> {
    pub const fn raw_get(ptr: *mut Self) -> *mut T {
        core::ptr::addr_of_mut!((*ptr).value)
    }
}

#[inline(always)]
unsafe fn unlock(ptr: *mut Self::State, _guard_state: &Self::GuardState) {
    unsafe { bindings::spin_unlock(CachePadded::raw_get(ptr)) }
}
```

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
  2023-05-03 11:32 ` [RFC PATCH 00/11] Rust null block driver Niklas Cassel
@ 2023-05-03 12:29   ` Andreas Hindborg
  2023-05-03 13:54     ` Niklas Cassel
  0 siblings, 1 reply; 69+ messages in thread
From: Andreas Hindborg @ 2023-05-03 12:29 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Matthew Wilcox, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, Benno Lossin,
	open list, gost.dev


Hi Niklas,

Niklas Cassel <Niklas.Cassel@wdc.com> writes:

> On Wed, May 03, 2023 at 11:06:57AM +0200, Andreas Hindborg wrote:
>> From: Andreas Hindborg <a.hindborg@samsung.com>
>> 
>
> (cut)
>
>> 
>> For each measurement the drivers are loaded, a drive is configured with memory
>> backing and a size of 4 GiB. C null_blk is configured to match the implemented
>> modes of the Rust driver: `blocksize` is set to 4 KiB, `completion_nsec` to 0,
>> `irqmode` to 0 (IRQ_NONE), `queue_mode` to 2 (MQ), `hw_queue_depth` to 256 and
>> `memory_backed` to 1. For both the drivers, the queue scheduler is set to
>> `none`. These measurements are made using 30 second runs of `fio` with the
>> `PSYNC` IO engine with workers pinned to separate CPU cores. The measurements
>> are done inside a virtual machine (qemu/kvm) on an Intel Alder Lake workstation
>> (i5-12600).
>
> Hello Andreas,
>
> I'm curious why you used psync ioengine for the benchmarks.
>
> As psync is a sync ioengine, it means queue depth == 1.
>
> Wouldn't it have been more interesting to see an async ioengine,
> together with different queue depths?

That would also be interesting. I was a bit constrained on CPU cycles so
I had to choose. I intend to produce the numbers you ask for. For now
here is two runs of random read using io_uring with queue depth 128
(same table style):


For iodepth_batch_submit=1, iodepth_batch_complete=1:
+---------+----------+---------------------+---------------------+
| jobs/bs | workload |          1          |          6          |
+---------+----------+---------------------+---------------------+
|    4k   | randread | 2.97 0.00 (0.9,0.0) | 4.06 0.00 (1.8,0.0) |
+---------+----------+---------------------+---------------------+

For iodepth_batch_submit=16, iodepth_batch_complete=16:
+---------+----------+---------------------+---------------------+
| jobs/bs | workload |          1          |          6          |
+---------+----------+---------------------+---------------------+
|    4k   | randread | 4.40 0.00 (1.1,0.0) | 4.87 0.00 (1.8,0.0) |
+---------+----------+---------------------+---------------------+

Above numbers are 60 second runs on bare metal, so not entirely
comparable with the ones in the cover letter.

> You might want to explain your table a bit more.

I understand that the table can be difficult to read. It is not easy to
convey all this information in ASCII email. The numbers in parenthesis
in the cells _are_ IOPS x 10e6 (read,write). Referring to he second
table above, for 1 job at 4k bs the Rust driver performed 4.8 percent
more IOPS than the C driver. The C driver did 1.1M IOPS. I hope this
clarifies the table, otherwise let me know!

> It might be nice to see IOPS and average latencies.

I did collect latency info as well, including completion latency
percentiles. It's just difficult to fit all that data in an email. I
have the fio json output, let me know if you want it and I will find a
way to get it to you. I am considering setting up some kind of CI that
will publish the performance results online automatically so that it
will be a link instead of an inline table.

>
> As an example of a table that I find easier to interpret,
> see e.g. the table on page 29 in the SPDK performance report:
> https://ci.spdk.io/download/performance-reports/SPDK_nvme_bdev_perf_report_2301.pdf

Thanks for the input, I will be sure to reference that next time. Just
for clarity, as you mentioned there is only one queue depth in play for
the numbers in the cover letter.

Best regards
Andreas

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 02/11] rust: add `pages` module for handling page allocation
  2023-05-03  9:06 ` [RFC PATCH 02/11] rust: add `pages` module for handling page allocation Andreas Hindborg
@ 2023-05-03 12:31   ` Benno Lossin
  2023-05-03 12:38     ` Benno Lossin
  2023-05-05  4:09   ` Matthew Wilcox
  1 sibling, 1 reply; 69+ messages in thread
From: Benno Lossin @ 2023-05-03 12:31 UTC (permalink / raw)
  To: Andreas Hindborg
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Andreas Hindborg, Matthew Wilcox, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	linux-kernel, gost.dev

On 03.05.23 11:06, Andreas Hindborg wrote:
> From: Andreas Hindborg <a.hindborg@samsung.com>
> 
> This patch adds support for working with pages of order 0. Support for pages
> with higher order is deferred. Page allocation flags are fixed in this patch.
> Future work might allow the user to specify allocation flags.
> 
> This patch is a heavily modified version of code available in the rust tree [1],
> primarily adding support for multiple page mapping strategies.
> 
> [1] https://github.com/rust-for-Linux/linux/tree/bc22545f38d74473cfef3e9fd65432733435b79f/rust/kernel/pages.rs
> 
> Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com>
> ---
>  rust/helpers.c       |  31 +++++
>  rust/kernel/lib.rs   |   6 +
>  rust/kernel/pages.rs | 284 +++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 321 insertions(+)
>  create mode 100644 rust/kernel/pages.rs
> 
> diff --git a/rust/helpers.c b/rust/helpers.c
> index 5dd5e325b7cc..9bd9d95da951 100644
> --- a/rust/helpers.c
> +++ b/rust/helpers.c
> @@ -27,6 +27,7 @@
>  #include <linux/sched/signal.h>
>  #include <linux/wait.h>
>  #include <linux/radix-tree.h>
> +#include <linux/highmem.h>
> 
>  __noreturn void rust_helper_BUG(void)
>  {
> @@ -150,6 +151,36 @@ void **rust_helper_radix_tree_next_slot(void **slot,
>  }
>  EXPORT_SYMBOL_GPL(rust_helper_radix_tree_next_slot);
> 
> +void *rust_helper_kmap(struct page *page)
> +{
> +	return kmap(page);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_kmap);
> +
> +void rust_helper_kunmap(struct page *page)
> +{
> +	return kunmap(page);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_kunmap);
> +
> +void *rust_helper_kmap_atomic(struct page *page)
> +{
> +	return kmap_atomic(page);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_kmap_atomic);
> +
> +void rust_helper_kunmap_atomic(void *address)
> +{
> +	kunmap_atomic(address);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_kunmap_atomic);
> +
> +struct page *rust_helper_alloc_pages(gfp_t gfp_mask, unsigned int order)
> +{
> +	return alloc_pages(gfp_mask, order);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_alloc_pages);
> +
>  /*
>   * We use `bindgen`'s `--size_t-is-usize` option to bind the C `size_t` type
>   * as the Rust `usize` type, so we can use it in contexts where Rust
> diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
> index a85cb6aae8d6..8bef6686504b 100644
> --- a/rust/kernel/lib.rs
> +++ b/rust/kernel/lib.rs
> @@ -38,6 +38,7 @@ mod build_assert;
>  pub mod error;
>  pub mod init;
>  pub mod ioctl;
> +pub mod pages;
>  pub mod prelude;
>  pub mod print;
>  pub mod radix_tree;
> @@ -57,6 +58,11 @@ pub use uapi;
>  #[doc(hidden)]
>  pub use build_error::build_error;
> 
> +/// Page size defined in terms of the `PAGE_SHIFT` macro from C.
> #@
> #@ `PAGE_SHIFT` is not using a doc-link.
> #@
> +///
> +/// [`PAGE_SHIFT`]: ../../../include/asm-generic/page.h
> +pub const PAGE_SIZE: u32 = 1 << bindings::PAGE_SHIFT;
> #@
> #@ This should be of type `usize`.
> #@
> +
>  /// Prefix to appear before log messages printed from within the `kernel` crate.
>  const __LOG_PREFIX: &[u8] = b"rust_kernel\0";
> 
> diff --git a/rust/kernel/pages.rs b/rust/kernel/pages.rs
> new file mode 100644
> index 000000000000..ed51b053dd5d
> --- /dev/null
> +++ b/rust/kernel/pages.rs
> @@ -0,0 +1,284 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +//! Kernel page allocation and management.
> +//!
> +//! This module currently provides limited support. It supports pages of order 0
> +//! for most operations. Page allocation flags are fixed.
> +
> +use crate::{bindings, error::code::*, error::Result, PAGE_SIZE};
> +use core::{marker::PhantomData, ptr};
> +
> +/// A set of physical pages.
> +///
> +/// `Pages` holds a reference to a set of pages of order `ORDER`. Having the order as a generic
> +/// const allows the struct to have the same size as a pointer.
> #@
> #@ I would remove the 'Having the order as a...' sentence. Since that is
> #@ just implementation detail.
> #@
> +///
> +/// # Invariants
> +///
> +/// The pointer `Pages::pages` is valid and points to 2^ORDER pages.
> #@
> #@ `Pages::pages` -> `pages`.
> #@
> +pub struct Pages<const ORDER: u32> {
> +    pub(crate) pages: *mut bindings::page,
> +}
> +
> +impl<const ORDER: u32> Pages<ORDER> {
> +    /// Allocates a new set of contiguous pages.
> +    pub fn new() -> Result<Self> {
> +        let pages = unsafe {
> +            bindings::alloc_pages(
> +                bindings::GFP_KERNEL | bindings::__GFP_ZERO | bindings::___GFP_HIGHMEM,
> +                ORDER,
> +            )
> +        };
> #@
> #@ Missing `SAFETY` comment.
> #@
> +        if pages.is_null() {
> +            return Err(ENOMEM);
> +        }
> +        // INVARIANTS: We checked that the allocation above succeeded.
> +        // SAFETY: We allocated pages above
> +        Ok(unsafe { Self::from_raw(pages) })
> +    }
> +
> +    /// Create a `Pages` from a raw `struct page` pointer
> +    ///
> +    /// # Safety
> +    ///
> +    /// Caller must own the pages pointed to by `ptr` as these will be freed
> +    /// when the returned `Pages` is dropped.
> +    pub unsafe fn from_raw(ptr: *mut bindings::page) -> Self {
> +        Self { pages: ptr }
> +    }
> +}
> +
> +impl Pages<0> {
> +    #[inline(always)]
> #@
> #@ Is this really needed? I think this function should be inlined
> #@ automatically.
> #@
> +    fn check_offset_and_map<I: MappingInfo>(
> +        &self,
> +        offset: usize,
> +        len: usize,
> +    ) -> Result<PageMapping<'_, I>>
> +    where
> +        Pages<0>: MappingActions<I>,
> #@
> #@ Why not use `Self: MappingActions<I>`?
> #@
> +    {
> +        let end = offset.checked_add(len).ok_or(EINVAL)?;
> +        if end as u32 > PAGE_SIZE {
> #@
> #@ Remove the `as u32`, since `PAGE_SIZE` should be of type `usize`.
> #@
> +            return Err(EINVAL);
> #@
> #@ I think it would make sense to create a more descriptive Rust error with
> #@ a `From` impl to turn it into an `Error`. It always is better to know from
> #@ the signature what exactly can go wrong when calling a function.
> #@
> +        }
> +
> +        let mapping = <Self as MappingActions<I>>::map(self);
> +
> +        Ok(mapping)
> #@
> #@ I would merge these lines.
> #@
> +    }
> +
> +    #[inline(always)]
> +    unsafe fn read_internal<I: MappingInfo>(
> #@
> #@ Missing `# Safety` section.
> #@
> +        &self,
> +        dest: *mut u8,
> +        offset: usize,
> +        len: usize,
> +    ) -> Result
> +    where
> +        Pages<0>: MappingActions<I>,
> +    {
> +        let mapping = self.check_offset_and_map::<I>(offset, len)?;
> +
> +        unsafe { ptr::copy_nonoverlapping((mapping.ptr as *mut u8).add(offset), dest, len) };
> #@
> #@ Missing `SAFETY` comment. Replace `as *mut u8` with `.cast::<u8>()`.
> #@
> +        Ok(())
> +    }
> +
> +    /// Maps the pages and reads from them into the given buffer.
> +    ///
> +    /// # Safety
> +    ///
> +    /// Callers must ensure that the destination buffer is valid for the given
> +    /// length. Additionally, if the raw buffer is intended to be recast, they
> +    /// must ensure that the data can be safely cast;
> +    /// [`crate::io_buffer::ReadableFromBytes`] has more details about it.
> +    /// `dest` may not point to the source page.
> #@
> #@ - `dest` is valid for writes for `len`.
> #@ - What is meant by 'the raw buffer is intended to be recast'?
> #@ - `io_buffer` does not yet exist in `rust-next`.
> #@
> +    #[inline(always)]
> +    pub unsafe fn read(&self, dest: *mut u8, offset: usize, len: usize) -> Result {
> +        unsafe { self.read_internal::<NormalMappingInfo>(dest, offset, len) }
> #@
> #@ Missing `SAFETY` comment.
> #@
> +    }
> +
> +    /// Maps the pages and reads from them into the given buffer. The page is
> +    /// mapped atomically.
> +    ///
> +    /// # Safety
> +    ///
> +    /// Callers must ensure that the destination buffer is valid for the given
> +    /// length. Additionally, if the raw buffer is intended to be recast, they
> +    /// must ensure that the data can be safely cast;
> +    /// [`crate::io_buffer::ReadableFromBytes`] has more details about it.
> +    /// `dest` may not point to the source page.
> +    #[inline(always)]
> +    pub unsafe fn read_atomic(&self, dest: *mut u8, offset: usize, len: usize) -> Result {
> +        unsafe { self.read_internal::<AtomicMappingInfo>(dest, offset, len) }
> #@
> #@ Missing `SAFETY` comment.
> #@
> +    }
> +
> +    #[inline(always)]
> +    unsafe fn write_internal<I: MappingInfo>(
> #@
> #@ Missing `# Safety` section.
> #@
> +        &self,
> +        src: *const u8,
> +        offset: usize,
> +        len: usize,
> +    ) -> Result
> +    where
> +        Pages<0>: MappingActions<I>,
> +    {
> +        let mapping = self.check_offset_and_map::<I>(offset, len)?;
> +
> +        unsafe { ptr::copy_nonoverlapping(src, (mapping.ptr as *mut u8).add(offset), len) };
> #@
> #@ Missing `SAFETY` comment.
> #@
> +        Ok(())
> +    }
> +
> +    /// Maps the pages and writes into them from the given buffer.
> +    ///
> +    /// # Safety
> +    ///
> +    /// Callers must ensure that the buffer is valid for the given length.
> +    /// Additionally, if the page is (or will be) mapped by userspace, they must
> +    /// ensure that no kernel data is leaked through padding if it was cast from
> +    /// another type; [`crate::io_buffer::WritableToBytes`] has more details
> +    /// about it. `src` must not point to the destination page.
> #@
> #@ `src` is valid for reads for `len`.
> #@
> +    #[inline(always)]
> +    pub unsafe fn write(&self, src: *const u8, offset: usize, len: usize) -> Result {
> +        unsafe { self.write_internal::<NormalMappingInfo>(src, offset, len) }
> +    }
> +
> +    /// Maps the pages and writes into them from the given buffer. The page is
> +    /// mapped atomically.
> +    ///
> +    /// # Safety
> +    ///
> +    /// Callers must ensure that the buffer is valid for the given length.
> +    /// Additionally, if the page is (or will be) mapped by userspace, they must
> +    /// ensure that no kernel data is leaked through padding if it was cast from
> +    /// another type; [`crate::io_buffer::WritableToBytes`] has more details
> +    /// about it. `src` must not point to the destination page.
> +    #[inline(always)]
> +    pub unsafe fn write_atomic(&self, src: *const u8, offset: usize, len: usize) -> Result {
> +        unsafe { self.write_internal::<AtomicMappingInfo>(src, offset, len) }
> +    }
> +
> +    /// Maps the page at index 0.
> +    #[inline(always)]
> +    pub fn kmap(&self) -> PageMapping<'_, NormalMappingInfo> {
> +        let ptr = unsafe { bindings::kmap(self.pages) };
> #@
> #@ Missing `SAFETY` comment.
> #@
> +
> +        PageMapping {
> +            page: self.pages,
> +            ptr,
> +            _phantom: PhantomData,
> +            _phantom2: PhantomData,
> +        }
> +    }
> +
> +    /// Atomically Maps the page at index 0.
> +    #[inline(always)]
> +    pub fn kmap_atomic(&self) -> PageMapping<'_, AtomicMappingInfo> {
> +        let ptr = unsafe { bindings::kmap_atomic(self.pages) };
> #@
> #@ Missing `SAFETY` comment.
> #@
> +
> +        PageMapping {
> +            page: self.pages,
> +            ptr,
> +            _phantom: PhantomData,
> +            _phantom2: PhantomData,
> +        }
> +    }
> +}
> +
> +impl<const ORDER: u32> Drop for Pages<ORDER> {
> +    fn drop(&mut self) {
> +        // SAFETY: By the type invariants, we know the pages are allocated with the given order.
> +        unsafe { bindings::__free_pages(self.pages, ORDER) };
> +    }
> +}
> +
> +/// Specifies the type of page mapping
> +pub trait MappingInfo {}
> +
> +/// Encapsulates methods to map and unmap pages
> +pub trait MappingActions<I: MappingInfo>
> +where
> +    Pages<0>: MappingActions<I>,
> +{
> +    /// Map a page into the kernel address scpace
> #@
> #@ Typo.
> #@
> +    fn map(pages: &Pages<0>) -> PageMapping<'_, I>;
> +
> +    /// Unmap a page specified by `mapping`
> +    ///
> +    /// # Safety
> +    ///
> +    /// Must only be called by `PageMapping::drop()`.
> +    unsafe fn unmap(mapping: &PageMapping<'_, I>);
> +}
> +
> +/// A type state indicating that pages were mapped atomically
> +pub struct AtomicMappingInfo;
> +impl MappingInfo for AtomicMappingInfo {}
> +
> +/// A type state indicating that pages were not mapped atomically
> +pub struct NormalMappingInfo;
> +impl MappingInfo for NormalMappingInfo {}
> +
> +impl MappingActions<AtomicMappingInfo> for Pages<0> {
> +    #[inline(always)]
> +    fn map(pages: &Pages<0>) -> PageMapping<'_, AtomicMappingInfo> {
> +        pages.kmap_atomic()
> +    }
> +
> +    #[inline(always)]
> +    unsafe fn unmap(mapping: &PageMapping<'_, AtomicMappingInfo>) {
> +        // SAFETY: An instance of `PageMapping` is created only when `kmap` succeeded for the given
> +        // page, so it is safe to unmap it here.
> +        unsafe { bindings::kunmap_atomic(mapping.ptr) };
> +    }
> +}
> +
> +impl MappingActions<NormalMappingInfo> for Pages<0> {
> +    #[inline(always)]
> +    fn map(pages: &Pages<0>) -> PageMapping<'_, NormalMappingInfo> {
> +        pages.kmap()
> +    }
> +
> +    #[inline(always)]
> +    unsafe fn unmap(mapping: &PageMapping<'_, NormalMappingInfo>) {
> +        // SAFETY: An instance of `PageMapping` is created only when `kmap` succeeded for the given
> +        // page, so it is safe to unmap it here.
> +        unsafe { bindings::kunmap(mapping.page) };
> +    }
> +}
> #@
> #@ I am not sure if this is the best implementation, why do the `kmap` and
> #@ `kmap_atomic` functions exist? Would it not make sense to implement
> #@ them entirely in `MappingActions::map`?
> #@
> +
> +/// An owned page mapping. When this struct is dropped, the page is unmapped.
> +pub struct PageMapping<'a, I: MappingInfo>
> +where
> +    Pages<0>: MappingActions<I>,
> +{
> +    page: *mut bindings::page,
> +    ptr: *mut core::ffi::c_void,
> +    _phantom: PhantomData<&'a i32>,
> +    _phantom2: PhantomData<I>,
> +}
> +
> +impl<'a, I: MappingInfo> PageMapping<'a, I>
> +where
> +    Pages<0>: MappingActions<I>,
> +{
> +    /// Return a pointer to the wrapped `struct page`
> +    #[inline(always)]
> +    pub fn get_ptr(&self) -> *mut core::ffi::c_void {
> +        self.ptr
> +    }
> +}
> +
> +// Because we do not have Drop specialization, we have to do this dance. Life
> +// would be much more simple if we could have `impl Drop for PageMapping<'_,
> +// Atomic>` and `impl Drop for PageMapping<'_, NotAtomic>`
> +impl<I: MappingInfo> Drop for PageMapping<'_, I>
> +where
> +    Pages<0>: MappingActions<I>,
> +{
> +    #[inline(always)]
> +    fn drop(&mut self) {
> +        // SAFETY: We are OK to call this because we are `PageMapping::drop()`
> +        unsafe { <Pages<0> as MappingActions<I>>::unmap(self) }
> +    }
> +}
> --
> 2.40.0

Here are some more general things:
- I think we could use this as an opportunity to add more docs about how
  paging works, or at least add some links to the C documentation.
- Can we improve the paging API? I have not given it any thought yet, but
  the current API looks very primitive.
- Documentation comments should form complete sentences (so end with `.`).

--
Cheers,
Benno


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 02/11] rust: add `pages` module for handling page allocation
  2023-05-03 12:31   ` Benno Lossin
@ 2023-05-03 12:38     ` Benno Lossin
  0 siblings, 0 replies; 69+ messages in thread
From: Benno Lossin @ 2023-05-03 12:38 UTC (permalink / raw)
  To: Benno Lossin
  Cc: Andreas Hindborg, Jens Axboe, Christoph Hellwig, Keith Busch,
	Damien Le Moal, Hannes Reinecke, lsf-pc, rust-for-linux,
	linux-block, Andreas Hindborg, Matthew Wilcox, Miguel Ojeda,
	Alex Gaynor, Wedson Almeida Filho, Boqun Feng, Gary Guo,
	Björn Roy Baron, linux-kernel, gost.dev

Sorry, forgot to replace `> #@` with nothing. Fixed here:

On 03.05.23 11:06, Andreas Hindborg wrote:
> From: Andreas Hindborg <a.hindborg@samsung.com>
>
> This patch adds support for working with pages of order 0. Support for pages
> with higher order is deferred. Page allocation flags are fixed in this patch.
> Future work might allow the user to specify allocation flags.
>
> This patch is a heavily modified version of code available in the rust tree [1],
> primarily adding support for multiple page mapping strategies.
>
> [1] https://github.com/rust-for-Linux/linux/tree/bc22545f38d74473cfef3e9fd65432733435b79f/rust/kernel/pages.rs
>
> Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com>
> ---
>  rust/helpers.c       |  31 +++++
>  rust/kernel/lib.rs   |   6 +
>  rust/kernel/pages.rs | 284 +++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 321 insertions(+)
>  create mode 100644 rust/kernel/pages.rs
>
> diff --git a/rust/helpers.c b/rust/helpers.c
> index 5dd5e325b7cc..9bd9d95da951 100644
> --- a/rust/helpers.c
> +++ b/rust/helpers.c
> @@ -27,6 +27,7 @@
>  #include <linux/sched/signal.h>
>  #include <linux/wait.h>
>  #include <linux/radix-tree.h>
> +#include <linux/highmem.h>
>
>  __noreturn void rust_helper_BUG(void)
>  {
> @@ -150,6 +151,36 @@ void **rust_helper_radix_tree_next_slot(void **slot,
>  }
>  EXPORT_SYMBOL_GPL(rust_helper_radix_tree_next_slot);
>
> +void *rust_helper_kmap(struct page *page)
> +{
> +	return kmap(page);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_kmap);
> +
> +void rust_helper_kunmap(struct page *page)
> +{
> +	return kunmap(page);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_kunmap);
> +
> +void *rust_helper_kmap_atomic(struct page *page)
> +{
> +	return kmap_atomic(page);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_kmap_atomic);
> +
> +void rust_helper_kunmap_atomic(void *address)
> +{
> +	kunmap_atomic(address);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_kunmap_atomic);
> +
> +struct page *rust_helper_alloc_pages(gfp_t gfp_mask, unsigned int order)
> +{
> +	return alloc_pages(gfp_mask, order);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_alloc_pages);
> +
>  /*
>   * We use `bindgen`'s `--size_t-is-usize` option to bind the C `size_t` type
>   * as the Rust `usize` type, so we can use it in contexts where Rust
> diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
> index a85cb6aae8d6..8bef6686504b 100644
> --- a/rust/kernel/lib.rs
> +++ b/rust/kernel/lib.rs
> @@ -38,6 +38,7 @@ mod build_assert;
>  pub mod error;
>  pub mod init;
>  pub mod ioctl;
> +pub mod pages;
>  pub mod prelude;
>  pub mod print;
>  pub mod radix_tree;
> @@ -57,6 +58,11 @@ pub use uapi;
>  #[doc(hidden)]
>  pub use build_error::build_error;
>
> +/// Page size defined in terms of the `PAGE_SHIFT` macro from C.

`PAGE_SHIFT` is not using a doc-link.

> +///
> +/// [`PAGE_SHIFT`]: ../../../include/asm-generic/page.h
> +pub const PAGE_SIZE: u32 = 1 << bindings::PAGE_SHIFT;

This should be of type `usize`.

> +
>  /// Prefix to appear before log messages printed from within the `kernel` crate.
>  const __LOG_PREFIX: &[u8] = b"rust_kernel\0";
>
> diff --git a/rust/kernel/pages.rs b/rust/kernel/pages.rs
> new file mode 100644
> index 000000000000..ed51b053dd5d
> --- /dev/null
> +++ b/rust/kernel/pages.rs
> @@ -0,0 +1,284 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +//! Kernel page allocation and management.
> +//!
> +//! This module currently provides limited support. It supports pages of order 0
> +//! for most operations. Page allocation flags are fixed.
> +
> +use crate::{bindings, error::code::*, error::Result, PAGE_SIZE};
> +use core::{marker::PhantomData, ptr};
> +
> +/// A set of physical pages.
> +///
> +/// `Pages` holds a reference to a set of pages of order `ORDER`. Having the order as a generic
> +/// const allows the struct to have the same size as a pointer.

I would remove the 'Having the order as a...' sentence. Since that is
just implementation detail.

> +///
> +/// # Invariants
> +///
> +/// The pointer `Pages::pages` is valid and points to 2^ORDER pages.

`Pages::pages` -> `pages`.

> +pub struct Pages<const ORDER: u32> {
> +    pub(crate) pages: *mut bindings::page,
> +}
> +
> +impl<const ORDER: u32> Pages<ORDER> {
> +    /// Allocates a new set of contiguous pages.
> +    pub fn new() -> Result<Self> {
> +        let pages = unsafe {
> +            bindings::alloc_pages(
> +                bindings::GFP_KERNEL | bindings::__GFP_ZERO | bindings::___GFP_HIGHMEM,
> +                ORDER,
> +            )
> +        };

Missing `SAFETY` comment.

> +        if pages.is_null() {
> +            return Err(ENOMEM);
> +        }
> +        // INVARIANTS: We checked that the allocation above succeeded.
> +        // SAFETY: We allocated pages above
> +        Ok(unsafe { Self::from_raw(pages) })
> +    }
> +
> +    /// Create a `Pages` from a raw `struct page` pointer
> +    ///
> +    /// # Safety
> +    ///
> +    /// Caller must own the pages pointed to by `ptr` as these will be freed
> +    /// when the returned `Pages` is dropped.
> +    pub unsafe fn from_raw(ptr: *mut bindings::page) -> Self {
> +        Self { pages: ptr }
> +    }
> +}
> +
> +impl Pages<0> {
> +    #[inline(always)]

Is this really needed? I think this function should be inlined
automatically.

> +    fn check_offset_and_map<I: MappingInfo>(
> +        &self,
> +        offset: usize,
> +        len: usize,
> +    ) -> Result<PageMapping<'_, I>>
> +    where
> +        Pages<0>: MappingActions<I>,

Why not use `Self: MappingActions<I>`?

> +    {
> +        let end = offset.checked_add(len).ok_or(EINVAL)?;
> +        if end as u32 > PAGE_SIZE {

Remove the `as u32`, since `PAGE_SIZE` should be of type `usize`.

> +            return Err(EINVAL);

I think it would make sense to create a more descriptive Rust error with
a `From` impl to turn it into an `Error`. It always is better to know from
the signature what exactly can go wrong when calling a function.

> +        }
> +
> +        let mapping = <Self as MappingActions<I>>::map(self);
> +
> +        Ok(mapping)

I would merge these lines.

> +    }
> +
> +    #[inline(always)]
> +    unsafe fn read_internal<I: MappingInfo>(

Missing `# Safety` section.

> +        &self,
> +        dest: *mut u8,
> +        offset: usize,
> +        len: usize,
> +    ) -> Result
> +    where
> +        Pages<0>: MappingActions<I>,
> +    {
> +        let mapping = self.check_offset_and_map::<I>(offset, len)?;
> +
> +        unsafe { ptr::copy_nonoverlapping((mapping.ptr as *mut u8).add(offset), dest, len) };

Missing `SAFETY` comment. Replace `as *mut u8` with `.cast::<u8>()`.

> +        Ok(())
> +    }
> +
> +    /// Maps the pages and reads from them into the given buffer.
> +    ///
> +    /// # Safety
> +    ///
> +    /// Callers must ensure that the destination buffer is valid for the given
> +    /// length. Additionally, if the raw buffer is intended to be recast, they
> +    /// must ensure that the data can be safely cast;
> +    /// [`crate::io_buffer::ReadableFromBytes`] has more details about it.
> +    /// `dest` may not point to the source page.

- `dest` is valid for writes for `len`.
- What is meant by 'the raw buffer is intended to be recast'?
- `io_buffer` does not yet exist in `rust-next`.

> +    #[inline(always)]
> +    pub unsafe fn read(&self, dest: *mut u8, offset: usize, len: usize) -> Result {
> +        unsafe { self.read_internal::<NormalMappingInfo>(dest, offset, len) }

Missing `SAFETY` comment.

> +    }
> +
> +    /// Maps the pages and reads from them into the given buffer. The page is
> +    /// mapped atomically.
> +    ///
> +    /// # Safety
> +    ///
> +    /// Callers must ensure that the destination buffer is valid for the given
> +    /// length. Additionally, if the raw buffer is intended to be recast, they
> +    /// must ensure that the data can be safely cast;
> +    /// [`crate::io_buffer::ReadableFromBytes`] has more details about it.
> +    /// `dest` may not point to the source page.
> +    #[inline(always)]
> +    pub unsafe fn read_atomic(&self, dest: *mut u8, offset: usize, len: usize) -> Result {
> +        unsafe { self.read_internal::<AtomicMappingInfo>(dest, offset, len) }

Missing `SAFETY` comment.

> +    }
> +
> +    #[inline(always)]
> +    unsafe fn write_internal<I: MappingInfo>(

Missing `# Safety` section.

> +        &self,
> +        src: *const u8,
> +        offset: usize,
> +        len: usize,
> +    ) -> Result
> +    where
> +        Pages<0>: MappingActions<I>,
> +    {
> +        let mapping = self.check_offset_and_map::<I>(offset, len)?;
> +
> +        unsafe { ptr::copy_nonoverlapping(src, (mapping.ptr as *mut u8).add(offset), len) };

Missing `SAFETY` comment.

> +        Ok(())
> +    }
> +
> +    /// Maps the pages and writes into them from the given buffer.
> +    ///
> +    /// # Safety
> +    ///
> +    /// Callers must ensure that the buffer is valid for the given length.
> +    /// Additionally, if the page is (or will be) mapped by userspace, they must
> +    /// ensure that no kernel data is leaked through padding if it was cast from
> +    /// another type; [`crate::io_buffer::WritableToBytes`] has more details
> +    /// about it. `src` must not point to the destination page.

`src` is valid for reads for `len`.

> +    #[inline(always)]
> +    pub unsafe fn write(&self, src: *const u8, offset: usize, len: usize) -> Result {
> +        unsafe { self.write_internal::<NormalMappingInfo>(src, offset, len) }
> +    }
> +
> +    /// Maps the pages and writes into them from the given buffer. The page is
> +    /// mapped atomically.
> +    ///
> +    /// # Safety
> +    ///
> +    /// Callers must ensure that the buffer is valid for the given length.
> +    /// Additionally, if the page is (or will be) mapped by userspace, they must
> +    /// ensure that no kernel data is leaked through padding if it was cast from
> +    /// another type; [`crate::io_buffer::WritableToBytes`] has more details
> +    /// about it. `src` must not point to the destination page.
> +    #[inline(always)]
> +    pub unsafe fn write_atomic(&self, src: *const u8, offset: usize, len: usize) -> Result {
> +        unsafe { self.write_internal::<AtomicMappingInfo>(src, offset, len) }
> +    }
> +
> +    /// Maps the page at index 0.
> +    #[inline(always)]
> +    pub fn kmap(&self) -> PageMapping<'_, NormalMappingInfo> {
> +        let ptr = unsafe { bindings::kmap(self.pages) };

Missing `SAFETY` comment.

> +
> +        PageMapping {
> +            page: self.pages,
> +            ptr,
> +            _phantom: PhantomData,
> +            _phantom2: PhantomData,
> +        }
> +    }
> +
> +    /// Atomically Maps the page at index 0.
> +    #[inline(always)]
> +    pub fn kmap_atomic(&self) -> PageMapping<'_, AtomicMappingInfo> {
> +        let ptr = unsafe { bindings::kmap_atomic(self.pages) };

Missing `SAFETY` comment.

> +
> +        PageMapping {
> +            page: self.pages,
> +            ptr,
> +            _phantom: PhantomData,
> +            _phantom2: PhantomData,
> +        }
> +    }
> +}
> +
> +impl<const ORDER: u32> Drop for Pages<ORDER> {
> +    fn drop(&mut self) {
> +        // SAFETY: By the type invariants, we know the pages are allocated with the given order.
> +        unsafe { bindings::__free_pages(self.pages, ORDER) };
> +    }
> +}
> +
> +/// Specifies the type of page mapping
> +pub trait MappingInfo {}
> +
> +/// Encapsulates methods to map and unmap pages
> +pub trait MappingActions<I: MappingInfo>
> +where
> +    Pages<0>: MappingActions<I>,
> +{
> +    /// Map a page into the kernel address scpace

Typo.

> +    fn map(pages: &Pages<0>) -> PageMapping<'_, I>;
> +
> +    /// Unmap a page specified by `mapping`
> +    ///
> +    /// # Safety
> +    ///
> +    /// Must only be called by `PageMapping::drop()`.
> +    unsafe fn unmap(mapping: &PageMapping<'_, I>);
> +}
> +
> +/// A type state indicating that pages were mapped atomically
> +pub struct AtomicMappingInfo;
> +impl MappingInfo for AtomicMappingInfo {}
> +
> +/// A type state indicating that pages were not mapped atomically
> +pub struct NormalMappingInfo;
> +impl MappingInfo for NormalMappingInfo {}
> +
> +impl MappingActions<AtomicMappingInfo> for Pages<0> {
> +    #[inline(always)]
> +    fn map(pages: &Pages<0>) -> PageMapping<'_, AtomicMappingInfo> {
> +        pages.kmap_atomic()
> +    }
> +
> +    #[inline(always)]
> +    unsafe fn unmap(mapping: &PageMapping<'_, AtomicMappingInfo>) {
> +        // SAFETY: An instance of `PageMapping` is created only when `kmap` succeeded for the given
> +        // page, so it is safe to unmap it here.
> +        unsafe { bindings::kunmap_atomic(mapping.ptr) };
> +    }
> +}
> +
> +impl MappingActions<NormalMappingInfo> for Pages<0> {
> +    #[inline(always)]
> +    fn map(pages: &Pages<0>) -> PageMapping<'_, NormalMappingInfo> {
> +        pages.kmap()
> +    }
> +
> +    #[inline(always)]
> +    unsafe fn unmap(mapping: &PageMapping<'_, NormalMappingInfo>) {
> +        // SAFETY: An instance of `PageMapping` is created only when `kmap` succeeded for the given
> +        // page, so it is safe to unmap it here.
> +        unsafe { bindings::kunmap(mapping.page) };
> +    }
> +}

I am not sure if this is the best implementation, why do the `kmap` and
`kmap_atomic` functions exist? Would it not make sense to implement
them entirely in `MappingActions::map`?

> +
> +/// An owned page mapping. When this struct is dropped, the page is unmapped.
> +pub struct PageMapping<'a, I: MappingInfo>
> +where
> +    Pages<0>: MappingActions<I>,
> +{
> +    page: *mut bindings::page,
> +    ptr: *mut core::ffi::c_void,
> +    _phantom: PhantomData<&'a i32>,
> +    _phantom2: PhantomData<I>,
> +}
> +
> +impl<'a, I: MappingInfo> PageMapping<'a, I>
> +where
> +    Pages<0>: MappingActions<I>,
> +{
> +    /// Return a pointer to the wrapped `struct page`
> +    #[inline(always)]
> +    pub fn get_ptr(&self) -> *mut core::ffi::c_void {
> +        self.ptr
> +    }
> +}
> +
> +// Because we do not have Drop specialization, we have to do this dance. Life
> +// would be much more simple if we could have `impl Drop for PageMapping<'_,
> +// Atomic>` and `impl Drop for PageMapping<'_, NotAtomic>`
> +impl<I: MappingInfo> Drop for PageMapping<'_, I>
> +where
> +    Pages<0>: MappingActions<I>,
> +{
> +    #[inline(always)]
> +    fn drop(&mut self) {
> +        // SAFETY: We are OK to call this because we are `PageMapping::drop()`
> +        unsafe { <Pages<0> as MappingActions<I>>::unmap(self) }
> +    }
> +}
> --
> 2.40.0

Here are some more general things:
- I think we could use this as an opportunity to add more docs about how
  paging works, or at least add some links to the C documentation.
- Can we improve the paging API? I have not given it any thought yet, but
  the current API looks very primitive.
- Documentation comments should form complete sentences (so end with `.`).

--
Cheers,
Benno


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
  2023-05-03 12:29   ` Andreas Hindborg
@ 2023-05-03 13:54     ` Niklas Cassel
  0 siblings, 0 replies; 69+ messages in thread
From: Niklas Cassel @ 2023-05-03 13:54 UTC (permalink / raw)
  To: Andreas Hindborg
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Matthew Wilcox, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, Benno Lossin,
	open list, gost.dev

On Wed, May 03, 2023 at 02:29:08PM +0200, Andreas Hindborg wrote:
> 

(cut)

> 
> For iodepth_batch_submit=1, iodepth_batch_complete=1:
> +---------+----------+---------------------+---------------------+
> | jobs/bs | workload |          1          |          6          |
> +---------+----------+---------------------+---------------------+
> |    4k   | randread | 2.97 0.00 (0.9,0.0) | 4.06 0.00 (1.8,0.0) |
> +---------+----------+---------------------+---------------------+
> 
> For iodepth_batch_submit=16, iodepth_batch_complete=16:
> +---------+----------+---------------------+---------------------+
> | jobs/bs | workload |          1          |          6          |
> +---------+----------+---------------------+---------------------+
> |    4k   | randread | 4.40 0.00 (1.1,0.0) | 4.87 0.00 (1.8,0.0) |
> +---------+----------+---------------------+---------------------+
> 
> Above numbers are 60 second runs on bare metal, so not entirely
> comparable with the ones in the cover letter.
> 
> > You might want to explain your table a bit more.
> 
> I understand that the table can be difficult to read. It is not easy to
> convey all this information in ASCII email. The numbers in parenthesis
> in the cells _are_ IOPS x 10e6 (read,write). Referring to he second
> table above, for 1 job at 4k bs the Rust driver performed 4.8 percent
> more IOPS than the C driver. The C driver did 1.1M IOPS. I hope this
> clarifies the table, otherwise let me know!

Yes, I can fully understand that it is hard to convey all the information
in a small text table. I do think it is better to have more small tables.
It will be a lot of text, but perhaps one table per "nr jobs" would have
been clearer.

Perhaps you don't need such fine granularity of "nr jobs" either, perhaps
skip some. Same thing can probably be said about block size increments,
you can probably skip some.

It seems like the numbers are very jittery. I assume that this is because
the drive utilization is so low. The IOPS are also very low.
You probably need to increase the QD for these numbers to be consistent.



Perhaps move this text to be just before the table itself:
"In this table each cell shows the relative performance of the Rust
driver to the C driver with the throughput of the C driver in parenthesis:
`rel_read rel_write (read_miops write_miops)`. Ex: the Rust driver performs 4.74
percent better than the C driver for fio randread with 2 jobs at 16 KiB
block size."

I think it would be good to print the IOPS for both C and Rust, as you now
only seem to print the C driver IOPS. That way it is also easier to verify
the relative differences. (Not that I think that you are making up numbers,
but you usually see IOPS for both implementations, followed by percentage
diffs.

See e.g. how Jens did it:
https://lore.kernel.org/linux-block/20190116175003.17880-1-axboe@kernel.dk/#r


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
  2023-05-03  9:06 [RFC PATCH 00/11] Rust null block driver Andreas Hindborg
                   ` (11 preceding siblings ...)
  2023-05-03 11:32 ` [RFC PATCH 00/11] Rust null block driver Niklas Cassel
@ 2023-05-03 16:47 ` Bart Van Assche
  2023-05-04 18:15   ` Andreas Hindborg
  2023-05-07 23:31 ` Luis Chamberlain
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 69+ messages in thread
From: Bart Van Assche @ 2023-05-03 16:47 UTC (permalink / raw)
  To: Andreas Hindborg, Jens Axboe, Christoph Hellwig, Keith Busch,
	Damien Le Moal, Hannes Reinecke, lsf-pc, rust-for-linux,
	linux-block
  Cc: Andreas Hindborg, Matthew Wilcox, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	Benno Lossin, open list, gost.dev

On 5/3/23 02:06, Andreas Hindborg wrote:
> This is an early preview of a null block driver written in Rust.

It is not clear to me why this effort was started? As far as I know the 
null_blk driver is only used by kernel developers for testing kernel 
changes so end users are not affected by bugs in this driver. 
Additionally, performance of this driver is critical since this driver 
is used to measure block layer performance. Does this mean that C is a 
better choice than Rust for this driver?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
  2023-05-03 16:47 ` Bart Van Assche
@ 2023-05-04 18:15   ` Andreas Hindborg
  2023-05-04 18:36     ` Bart Van Assche
  0 siblings, 1 reply; 69+ messages in thread
From: Andreas Hindborg @ 2023-05-04 18:15 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Matthew Wilcox, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, Benno Lossin,
	open list, gost.dev


Hi Bart,

Bart Van Assche <bvanassche@acm.org> writes:

> On 5/3/23 02:06, Andreas Hindborg wrote:
>> This is an early preview of a null block driver written in Rust.
>
> It is not clear to me why this effort was started? As far as I know the null_blk
> driver is only used by kernel developers for testing kernel changes so end users
> are not affected by bugs in this driver. Additionally, performance of this
> driver is critical since this driver is used to measure block layer performance.
> Does this mean that C is a better choice than Rust for this driver?

I take it you did not read the rest of the cover letter. Let me quote
some of it here:

> A null block driver is a good opportunity to evaluate Rust bindings for the
> block layer. It is a small and simple driver and thus should be simple to reason
> about. Further, the null block driver is not usually deployed in production
> environments. Thus, it should be fairly straight forward to review, and any
> potential issues are not going to bring down any production workloads.
>
> Being small and simple, the null block driver is a good place to introduce the
> Linux kernel storage community to Rust. This will help prepare the community for
> future Rust projects and facilitate a better maintenance process for these
> projects.
>
> The statistics presented in my previous message [1] show that the C null block
> driver has had a significant amount of memory safety related problems in the
> past. 41% of fixes merged for the C null block driver are fixes for memory
> safety issues. This makes the null block driver a good candidate for rewriting
> in Rust.

In relation to performance, it turns out that there is not much of a
difference. For memory safety bugs - I think we are better off without
them, no matter if we are user facing or not.

If it is still unclear to you why this effort was started, please do let
me know and I shall try to clarify further :)

Best regards,
Andreas

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
  2023-05-04 18:15   ` Andreas Hindborg
@ 2023-05-04 18:36     ` Bart Van Assche
  2023-05-04 18:46       ` Andreas Hindborg
                         ` (2 more replies)
  0 siblings, 3 replies; 69+ messages in thread
From: Bart Van Assche @ 2023-05-04 18:36 UTC (permalink / raw)
  To: Andreas Hindborg
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Matthew Wilcox, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, Benno Lossin,
	open list, gost.dev

On 5/4/23 11:15, Andreas Hindborg wrote:
> If it is still unclear to you why this effort was started, please do let
> me know and I shall try to clarify further :)

It seems like I was too polite in my previous email. What I meant is that
rewriting code is useful if it provides a clear advantage to the users of
a driver. For null_blk, the users are kernel developers. The code that has
been posted is the start of a rewrite of the null_blk driver. The benefits
of this rewrite (making low-level memory errors less likely) do not outweigh
the risks that this effort will introduce functional or performance regressions.

Bart.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
  2023-05-04 18:36     ` Bart Van Assche
@ 2023-05-04 18:46       ` Andreas Hindborg
  2023-05-04 18:52       ` Keith Busch
  2023-05-05  5:28       ` Hannes Reinecke
  2 siblings, 0 replies; 69+ messages in thread
From: Andreas Hindborg @ 2023-05-04 18:46 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Matthew Wilcox, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, Benno Lossin,
	open list, gost.dev


Bart Van Assche <bvanassche@acm.org> writes:

> On 5/4/23 11:15, Andreas Hindborg wrote:
>> If it is still unclear to you why this effort was started, please do let
>> me know and I shall try to clarify further :)
>
> It seems like I was too polite in my previous email. What I meant is that
> rewriting code is useful if it provides a clear advantage to the users of
> a driver. For null_blk, the users are kernel developers. The code that has
> been posted is the start of a rewrite of the null_blk driver. The benefits
> of this rewrite (making low-level memory errors less likely) do not outweigh
> the risks that this effort will introduce functional or performance regressions.

If this turns in to a full rewrite instead of just a demonstrator, we
will be in the lucky situation that we have the existing C version to
verify performance and functionality against. Unnoticed regressions are
unlikely in this sense.

If we want to have Rust abstractions for the block layer in the kernel
(some people do), then having a simple driver in Rust to regression test
these abstractions with, is good value.

Best regards,
Andreas

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
  2023-05-04 18:36     ` Bart Van Assche
  2023-05-04 18:46       ` Andreas Hindborg
@ 2023-05-04 18:52       ` Keith Busch
  2023-05-04 19:02         ` Jens Axboe
  2023-05-05  5:28       ` Hannes Reinecke
  2 siblings, 1 reply; 69+ messages in thread
From: Keith Busch @ 2023-05-04 18:52 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Andreas Hindborg, Jens Axboe, Christoph Hellwig, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Matthew Wilcox, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, Benno Lossin,
	open list, gost.dev

On Thu, May 04, 2023 at 11:36:01AM -0700, Bart Van Assche wrote:
> On 5/4/23 11:15, Andreas Hindborg wrote:
> > If it is still unclear to you why this effort was started, please do let
> > me know and I shall try to clarify further :)
> 
> It seems like I was too polite in my previous email. What I meant is that
> rewriting code is useful if it provides a clear advantage to the users of
> a driver. For null_blk, the users are kernel developers. The code that has
> been posted is the start of a rewrite of the null_blk driver. The benefits
> of this rewrite (making low-level memory errors less likely) do not outweigh
> the risks that this effort will introduce functional or performance regressions.

Instead of replacing, would co-existing be okay? Of course as long as
there's no requirement to maintain feature parity between the two.
Actually, just call it "rust_blk" and declare it has no relationship to
null_blk, despite their functional similarities: it's a developer
reference implementation for a rust block driver.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
  2023-05-04 18:52       ` Keith Busch
@ 2023-05-04 19:02         ` Jens Axboe
  2023-05-04 19:59           ` Andreas Hindborg
                             ` (3 more replies)
  0 siblings, 4 replies; 69+ messages in thread
From: Jens Axboe @ 2023-05-04 19:02 UTC (permalink / raw)
  To: Keith Busch, Bart Van Assche
  Cc: Andreas Hindborg, Christoph Hellwig, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Matthew Wilcox, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, Benno Lossin,
	open list, gost.dev

On 5/4/23 12:52?PM, Keith Busch wrote:
> On Thu, May 04, 2023 at 11:36:01AM -0700, Bart Van Assche wrote:
>> On 5/4/23 11:15, Andreas Hindborg wrote:
>>> If it is still unclear to you why this effort was started, please do let
>>> me know and I shall try to clarify further :)
>>
>> It seems like I was too polite in my previous email. What I meant is that
>> rewriting code is useful if it provides a clear advantage to the users of
>> a driver. For null_blk, the users are kernel developers. The code that has
>> been posted is the start of a rewrite of the null_blk driver. The benefits
>> of this rewrite (making low-level memory errors less likely) do not outweigh
>> the risks that this effort will introduce functional or performance regressions.
> 
> Instead of replacing, would co-existing be okay? Of course as long as
> there's no requirement to maintain feature parity between the two.
> Actually, just call it "rust_blk" and declare it has no relationship to
> null_blk, despite their functional similarities: it's a developer
> reference implementation for a rust block driver.

To me, the big discussion point isn't really whether we're doing
null_blk or not, it's more if we want to go down this path of
maintaining rust bindings for the block code in general. If the answer
to that is yes, then doing null_blk seems like a great choice as it's
not a critical piece of infrastructure. It might even be a good idea to
be able to run both, for performance purposes, as the bindings or core
changes.

But back to the real question... This is obviously extra burden on
maintainers, and that needs to be sorted out first. Block drivers in
general are not super security sensitive, as it's mostly privileged code
and there's not a whole lot of user visibile API. And the stuff we do
have is reasonably basic. So what's the long term win of having rust
bindings? This is a legitimate question. I can see a lot of other more
user exposed subsystems being of higher interest here.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
  2023-05-04 19:02         ` Jens Axboe
@ 2023-05-04 19:59           ` Andreas Hindborg
  2023-05-04 20:55             ` Jens Axboe
  2023-05-04 20:11           ` Miguel Ojeda
                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 69+ messages in thread
From: Andreas Hindborg @ 2023-05-04 19:59 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Bart Van Assche, Christoph Hellwig, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Matthew Wilcox, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, Benno Lossin,
	open list, gost.dev


Jens Axboe <axboe@kernel.dk> writes:

> On 5/4/23 12:52?PM, Keith Busch wrote:
>> On Thu, May 04, 2023 at 11:36:01AM -0700, Bart Van Assche wrote:
>>> On 5/4/23 11:15, Andreas Hindborg wrote:
>>>> If it is still unclear to you why this effort was started, please do let
>>>> me know and I shall try to clarify further :)
>>>
>>> It seems like I was too polite in my previous email. What I meant is that
>>> rewriting code is useful if it provides a clear advantage to the users of
>>> a driver. For null_blk, the users are kernel developers. The code that has
>>> been posted is the start of a rewrite of the null_blk driver. The benefits
>>> of this rewrite (making low-level memory errors less likely) do not outweigh
>>> the risks that this effort will introduce functional or performance regressions.
>> 
>> Instead of replacing, would co-existing be okay? Of course as long as
>> there's no requirement to maintain feature parity between the two.
>> Actually, just call it "rust_blk" and declare it has no relationship to
>> null_blk, despite their functional similarities: it's a developer
>> reference implementation for a rust block driver.
>
> To me, the big discussion point isn't really whether we're doing
> null_blk or not, it's more if we want to go down this path of
> maintaining rust bindings for the block code in general. If the answer
> to that is yes, then doing null_blk seems like a great choice as it's
> not a critical piece of infrastructure. It might even be a good idea to
> be able to run both, for performance purposes, as the bindings or core
> changes.
>
> But back to the real question... This is obviously extra burden on
> maintainers, and that needs to be sorted out first. Block drivers in
> general are not super security sensitive, as it's mostly privileged code
> and there's not a whole lot of user visibile API. And the stuff we do
> have is reasonably basic. So what's the long term win of having rust
> bindings? This is a legitimate question. I can see a lot of other more
> user exposed subsystems being of higher interest here.

Even though the block layer is not usually exposed in the same way that
something like the USB stack is, absence of memory safety bugs is a very
useful property. If this is attainable without sacrificing performance,
it seems like a nice option to offer future block device driver
developers. Some would argue that it is worth offering even in the face
of performance regression.

While memory safety is the primary feature that Rust brings to the
table, it does come with other nice features as well. The type system,
language support stackless coroutines and error handling language
support are all very useful.

Regarding maintenance of the bindings, it _is_ an amount extra work. But
there is more than one way to structure that work. If Rust is accepted
into the block layer at some point, maintenance could be structured in
such a way that it does not get in the way of existing C maintenance
work. A "rust keeps up or it breaks" model. That could work for a while.

Best regards
Andreas

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
  2023-05-04 19:02         ` Jens Axboe
  2023-05-04 19:59           ` Andreas Hindborg
@ 2023-05-04 20:11           ` Miguel Ojeda
  2023-05-04 20:22             ` Jens Axboe
  2023-05-05  3:52           ` Christoph Hellwig
  2023-06-06 13:33           ` Andreas Hindborg (Samsung)
  3 siblings, 1 reply; 69+ messages in thread
From: Miguel Ojeda @ 2023-05-04 20:11 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Bart Van Assche, Andreas Hindborg,
	Christoph Hellwig, Damien Le Moal, Hannes Reinecke, lsf-pc,
	rust-for-linux, linux-block, Matthew Wilcox, Miguel Ojeda,
	Alex Gaynor, Wedson Almeida Filho, Boqun Feng, Gary Guo,
	Björn Roy Baron, Benno Lossin, open list, gost.dev,
	Daniel Vetter

On Thu, May 4, 2023 at 9:02 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> But back to the real question... This is obviously extra burden on
> maintainers, and that needs to be sorted out first. Block drivers in

Regarding maintenance, something we have suggested in similar cases to
other subsystems is that the author gets involved as a maintainer of,
at least, the Rust abstractions/driver (possibly with a different
`MAINTAINERS` entry).

Of course, that is still work for the existing maintainer(s), i.e.
you, since coordination takes time. However, it can also be a nice way
to learn Rust on the side, meanwhile things are getting upstreamed and
discussed (I think Daniel, in Cc, is taking that approach).

And it may also be a way for you to get an extra
maintainer/reviewer/... later on for the C parts, too, even if Rust
does not succeed.

> general are not super security sensitive, as it's mostly privileged code
> and there's not a whole lot of user visibile API. And the stuff we do
> have is reasonably basic. So what's the long term win of having rust
> bindings? This is a legitimate question. I can see a lot of other more
> user exposed subsystems being of higher interest here.

From the experience of other kernel maintainers/developers that are
making the move, the advantages seem to be well worth it, even
disregarding the security aspect, i.e. on the language side alone.

Cheers,
Miguel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
  2023-05-04 20:11           ` Miguel Ojeda
@ 2023-05-04 20:22             ` Jens Axboe
  2023-05-05 10:53               ` Miguel Ojeda
  0 siblings, 1 reply; 69+ messages in thread
From: Jens Axboe @ 2023-05-04 20:22 UTC (permalink / raw)
  To: Miguel Ojeda
  Cc: Keith Busch, Bart Van Assche, Andreas Hindborg,
	Christoph Hellwig, Damien Le Moal, Hannes Reinecke, lsf-pc,
	rust-for-linux, linux-block, Matthew Wilcox, Miguel Ojeda,
	Alex Gaynor, Wedson Almeida Filho, Boqun Feng, Gary Guo,
	Björn Roy Baron, Benno Lossin, open list, gost.dev,
	Daniel Vetter

On 5/4/23 2:11?PM, Miguel Ojeda wrote:
> On Thu, May 4, 2023 at 9:02?PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> But back to the real question... This is obviously extra burden on
>> maintainers, and that needs to be sorted out first. Block drivers in
> 
> Regarding maintenance, something we have suggested in similar cases to
> other subsystems is that the author gets involved as a maintainer of,
> at least, the Rust abstractions/driver (possibly with a different
> `MAINTAINERS` entry).

Right, but that doesn't really solve the problem when the rust bindings
get in the way of changes that you are currently making. Or if you break
them inadvertently. I do see benefits to that approach, but it's no
panacea.

> Of course, that is still work for the existing maintainer(s), i.e.
> you, since coordination takes time. However, it can also be a nice way
> to learn Rust on the side, meanwhile things are getting upstreamed and
> discussed (I think Daniel, in Cc, is taking that approach).

This seems to assume that time is plentiful and we can just add more to
our plate, which isn't always true. While I'd love to do more rust and
get more familiar with it, the time still has to be there for that. I'm
actually typing this on a laptop with a rust gpu driver :-)

And this isn't just on me, there are other regular contributors and
reviewers that would need to be onboard with this.

> And it may also be a way for you to get an extra
> maintainer/reviewer/... later on for the C parts, too, even if Rust
> does not succeed.

That is certainly a win!

>> general are not super security sensitive, as it's mostly privileged code
>> and there's not a whole lot of user visibile API. And the stuff we do
>> have is reasonably basic. So what's the long term win of having rust
>> bindings? This is a legitimate question. I can see a lot of other more
>> user exposed subsystems being of higher interest here.
> 
> From the experience of other kernel maintainers/developers that are
> making the move, the advantages seem to be well worth it, even
> disregarding the security aspect, i.e. on the language side alone.

Each case is different though, different people and different schedules
and priorities. So while the above is promising, it's also just
annecdotal and doesn't necessarily apply to our case.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
  2023-05-04 19:59           ` Andreas Hindborg
@ 2023-05-04 20:55             ` Jens Axboe
  2023-05-05  5:06               ` Andreas Hindborg
  2023-05-05 11:14               ` Miguel Ojeda
  0 siblings, 2 replies; 69+ messages in thread
From: Jens Axboe @ 2023-05-04 20:55 UTC (permalink / raw)
  To: Andreas Hindborg
  Cc: Keith Busch, Bart Van Assche, Christoph Hellwig, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Matthew Wilcox, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, Benno Lossin,
	open list, gost.dev

On 5/4/23 1:59?PM, Andreas Hindborg wrote:
> 
> Jens Axboe <axboe@kernel.dk> writes:
> 
>> On 5/4/23 12:52?PM, Keith Busch wrote:
>>> On Thu, May 04, 2023 at 11:36:01AM -0700, Bart Van Assche wrote:
>>>> On 5/4/23 11:15, Andreas Hindborg wrote:
>>>>> If it is still unclear to you why this effort was started, please do let
>>>>> me know and I shall try to clarify further :)
>>>>
>>>> It seems like I was too polite in my previous email. What I meant is that
>>>> rewriting code is useful if it provides a clear advantage to the users of
>>>> a driver. For null_blk, the users are kernel developers. The code that has
>>>> been posted is the start of a rewrite of the null_blk driver. The benefits
>>>> of this rewrite (making low-level memory errors less likely) do not outweigh
>>>> the risks that this effort will introduce functional or performance regressions.
>>>
>>> Instead of replacing, would co-existing be okay? Of course as long as
>>> there's no requirement to maintain feature parity between the two.
>>> Actually, just call it "rust_blk" and declare it has no relationship to
>>> null_blk, despite their functional similarities: it's a developer
>>> reference implementation for a rust block driver.
>>
>> To me, the big discussion point isn't really whether we're doing
>> null_blk or not, it's more if we want to go down this path of
>> maintaining rust bindings for the block code in general. If the answer
>> to that is yes, then doing null_blk seems like a great choice as it's
>> not a critical piece of infrastructure. It might even be a good idea to
>> be able to run both, for performance purposes, as the bindings or core
>> changes.
>>
>> But back to the real question... This is obviously extra burden on
>> maintainers, and that needs to be sorted out first. Block drivers in
>> general are not super security sensitive, as it's mostly privileged code
>> and there's not a whole lot of user visibile API. And the stuff we do
>> have is reasonably basic. So what's the long term win of having rust
>> bindings? This is a legitimate question. I can see a lot of other more
>> user exposed subsystems being of higher interest here.
> 
> Even though the block layer is not usually exposed in the same way
> that something like the USB stack is, absence of memory safety bugs is
> a very useful property. If this is attainable without sacrificing
> performance, it seems like a nice option to offer future block device
> driver developers. Some would argue that it is worth offering even in
> the face of performance regression.
> 
> While memory safety is the primary feature that Rust brings to the
> table, it does come with other nice features as well. The type system,
> language support stackless coroutines and error handling language
> support are all very useful.

We're in violent agreement on this part, I don't think anyone sane would
argue that memory safety with the same performance [1] isn't something
you'd want. And the error handling with rust is so much better than the
C stuff drivers do now that I can't see anyone disagreeing on that being
a great thing as well.

The discussion point here is the price being paid in terms of people
time.

> Regarding maintenance of the bindings, it _is_ an amount extra work. But
> there is more than one way to structure that work. If Rust is accepted
> into the block layer at some point, maintenance could be structured in
> such a way that it does not get in the way of existing C maintenance
> work. A "rust keeps up or it breaks" model. That could work for a while.

That potentially works for null_blk, but it would not work for anything
that people actually depend on. In other words, anything that isn't
null_blk. And I don't believe we'd be actively discussing these bindings
if just doing null_blk is the end goal, because that isn't useful by
itself, and at that point we'd all just be wasting our time. In the real
world, once we have just one actual driver using it, then we'd be
looking at "this driver regressed because of change X/Y/Z and that needs
to get sorted before the next release". And THAT is the real issue for
me. So a "rust keeps up or it breaks" model is a bit naive in my
opinion, it's just not a viable approach. In fact, even for null_blk,
this doesn't really fly as we rely on blktests to continually vet the
sanity of the IO stack, and null_blk is an integral part of that.

So I really don't think there's much to debate between "rust people vs
jens" here, as we agree on the benefits, but my end of the table has to
stomach the cons. And like I mentioned in an earlier email, that's not
just on me, there are other regular contributors and reviewers that are
relevant to this discussion. This is something we need to discuss.

[1] We obviously need to do real numbers here, the ones posted I don't
consider stable enough to be useful in saying "yeah it's fully on part".
If you have an updated rust nvme driver that uses these bindings I'd
be happy to run some testing that will definitively tell us if there's a
performance win, loss, or parity, and how much.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
  2023-05-04 19:02         ` Jens Axboe
  2023-05-04 19:59           ` Andreas Hindborg
  2023-05-04 20:11           ` Miguel Ojeda
@ 2023-05-05  3:52           ` Christoph Hellwig
  2023-06-06 13:33           ` Andreas Hindborg (Samsung)
  3 siblings, 0 replies; 69+ messages in thread
From: Christoph Hellwig @ 2023-05-05  3:52 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Bart Van Assche, Andreas Hindborg,
	Christoph Hellwig, Damien Le Moal, Hannes Reinecke, lsf-pc,
	rust-for-linux, linux-block, Matthew Wilcox, Miguel Ojeda,
	Alex Gaynor, Wedson Almeida Filho, Boqun Feng, Gary Guo,
	Björn Roy Baron, Benno Lossin, open list, gost.dev

On Thu, May 04, 2023 at 01:02:19PM -0600, Jens Axboe wrote:
> null_blk or not, it's more if we want to go down this path of
> maintaining rust bindings for the block code in general. If the answer
> to that is yes, then doing null_blk seems like a great choice as it's
> not a critical piece of infrastructure. It might even be a good idea to
> be able to run both, for performance purposes, as the bindings or core
> changes.

Yes.  And I'm not in favor of it, especially right now.  There is
so much work we need to do that requires changes all over (e.g.
sorting out the request_queue vs gendisk, and making the bio_vec
folio or even better physical address based), and the last thing I
need is a binding to a another language, one that happens to have
nice features but that also is really ugly.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 01/11] rust: add radix tree abstraction
  2023-05-03  9:06 ` [RFC PATCH 01/11] rust: add radix tree abstraction Andreas Hindborg
  2023-05-03 10:34   ` Benno Lossin
@ 2023-05-05  4:04   ` Matthew Wilcox
  2023-05-05  4:49     ` Andreas Hindborg
  1 sibling, 1 reply; 69+ messages in thread
From: Matthew Wilcox @ 2023-05-05  4:04 UTC (permalink / raw)
  To: Andreas Hindborg
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Andreas Hindborg, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	Benno Lossin, open list, gost.dev

On Wed, May 03, 2023 at 11:06:58AM +0200, Andreas Hindborg wrote:
> From: Andreas Hindborg <a.hindborg@samsung.com>
> 
> Add abstractions for the C radix_tree. This abstraction allows Rust code to use
> the radix_tree as a map from `u64` keys to `ForeignOwnable` values.

Please, no.  The XArray interface is the preferred one; the radix tree
is legacy.  Don't make Rust code use the radix tree.  It has the GFP
arguments in the wrong place, for one thing.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 02/11] rust: add `pages` module for handling page allocation
  2023-05-03  9:06 ` [RFC PATCH 02/11] rust: add `pages` module for handling page allocation Andreas Hindborg
  2023-05-03 12:31   ` Benno Lossin
@ 2023-05-05  4:09   ` Matthew Wilcox
  2023-05-05  4:42     ` Andreas Hindborg
  1 sibling, 1 reply; 69+ messages in thread
From: Matthew Wilcox @ 2023-05-05  4:09 UTC (permalink / raw)
  To: Andreas Hindborg
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Andreas Hindborg, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	Benno Lossin, open list, gost.dev

On Wed, May 03, 2023 at 11:06:59AM +0200, Andreas Hindborg wrote:
> From: Andreas Hindborg <a.hindborg@samsung.com>
> 
> This patch adds support for working with pages of order 0. Support for pages
> with higher order is deferred. Page allocation flags are fixed in this patch.
> Future work might allow the user to specify allocation flags.
> 
> This patch is a heavily modified version of code available in the rust tree [1],
> primarily adding support for multiple page mapping strategies.

This also seems misaligned with the direction of Linux development.
Folios are the future, pages are legacy.  Please, ask about what's
going on before wasting time on the past.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 02/11] rust: add `pages` module for handling page allocation
  2023-05-05  4:09   ` Matthew Wilcox
@ 2023-05-05  4:42     ` Andreas Hindborg
  2023-05-05  5:29       ` Matthew Wilcox
  0 siblings, 1 reply; 69+ messages in thread
From: Andreas Hindborg @ 2023-05-05  4:42 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho, Boqun Feng,
	Gary Guo, Björn Roy Baron, Benno Lossin, open list,
	gost.dev


Matthew Wilcox <willy@infradead.org> writes:

> On Wed, May 03, 2023 at 11:06:59AM +0200, Andreas Hindborg wrote:
>> From: Andreas Hindborg <a.hindborg@samsung.com>
>> 
>> This patch adds support for working with pages of order 0. Support for pages
>> with higher order is deferred. Page allocation flags are fixed in this patch.
>> Future work might allow the user to specify allocation flags.
>> 
>> This patch is a heavily modified version of code available in the rust tree [1],
>> primarily adding support for multiple page mapping strategies.
>
> This also seems misaligned with the direction of Linux development.
> Folios are the future, pages are legacy.  Please, ask about what's
> going on before wasting time on the past.

I see, thanks for the heads up! In this case I wanted to do an
apples-apples comparison to the C null_blk driver. Since that is using
kmap I wanted to have that. But let's bind to the folio_* APIs in the
future, that would make sense.

Best regards
Andreas


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 01/11] rust: add radix tree abstraction
  2023-05-05  4:04   ` Matthew Wilcox
@ 2023-05-05  4:49     ` Andreas Hindborg
  2023-05-05  5:28       ` Matthew Wilcox
  0 siblings, 1 reply; 69+ messages in thread
From: Andreas Hindborg @ 2023-05-05  4:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho, Boqun Feng,
	Gary Guo, Björn Roy Baron, Benno Lossin, open list,
	gost.dev


Matthew Wilcox <willy@infradead.org> writes:

> On Wed, May 03, 2023 at 11:06:58AM +0200, Andreas Hindborg wrote:
>> From: Andreas Hindborg <a.hindborg@samsung.com>
>> 
>> Add abstractions for the C radix_tree. This abstraction allows Rust code to use
>> the radix_tree as a map from `u64` keys to `ForeignOwnable` values.
>
> Please, no.  The XArray interface is the preferred one; the radix tree
> is legacy.  Don't make Rust code use the radix tree.  It has the GFP
> arguments in the wrong place, for one thing.

I have a similar argument to not using xarrray as to not using folios,
see my other response. But the Rust xarray API is in the works [1].

Best regards,
Andreas

[1] https://lore.kernel.org/rust-for-linux/1bb98ef1-727a-45d6-3cf6-39765fe99c5c@asahilina.net/

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
  2023-05-04 20:55             ` Jens Axboe
@ 2023-05-05  5:06               ` Andreas Hindborg
  2023-05-05 11:14               ` Miguel Ojeda
  1 sibling, 0 replies; 69+ messages in thread
From: Andreas Hindborg @ 2023-05-05  5:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Bart Van Assche, Christoph Hellwig, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Matthew Wilcox, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, Benno Lossin,
	open list, gost.dev


Jens Axboe <axboe@kernel.dk> writes:

> On 5/4/23 1:59?PM, Andreas Hindborg wrote:
>> 
>> Jens Axboe <axboe@kernel.dk> writes:
>> 
>>> On 5/4/23 12:52?PM, Keith Busch wrote:
>>>> On Thu, May 04, 2023 at 11:36:01AM -0700, Bart Van Assche wrote:
>>>>> On 5/4/23 11:15, Andreas Hindborg wrote:
>>>>>> If it is still unclear to you why this effort was started, please do let
>>>>>> me know and I shall try to clarify further :)
>>>>>
>>>>> It seems like I was too polite in my previous email. What I meant is that
>>>>> rewriting code is useful if it provides a clear advantage to the users of
>>>>> a driver. For null_blk, the users are kernel developers. The code that has
>>>>> been posted is the start of a rewrite of the null_blk driver. The benefits
>>>>> of this rewrite (making low-level memory errors less likely) do not outweigh
>>>>> the risks that this effort will introduce functional or performance regressions.
>>>>
>>>> Instead of replacing, would co-existing be okay? Of course as long as
>>>> there's no requirement to maintain feature parity between the two.
>>>> Actually, just call it "rust_blk" and declare it has no relationship to
>>>> null_blk, despite their functional similarities: it's a developer
>>>> reference implementation for a rust block driver.
>>>
>>> To me, the big discussion point isn't really whether we're doing
>>> null_blk or not, it's more if we want to go down this path of
>>> maintaining rust bindings for the block code in general. If the answer
>>> to that is yes, then doing null_blk seems like a great choice as it's
>>> not a critical piece of infrastructure. It might even be a good idea to
>>> be able to run both, for performance purposes, as the bindings or core
>>> changes.
>>>
>>> But back to the real question... This is obviously extra burden on
>>> maintainers, and that needs to be sorted out first. Block drivers in
>>> general are not super security sensitive, as it's mostly privileged code
>>> and there's not a whole lot of user visibile API. And the stuff we do
>>> have is reasonably basic. So what's the long term win of having rust
>>> bindings? This is a legitimate question. I can see a lot of other more
>>> user exposed subsystems being of higher interest here.
>> 
>> Even though the block layer is not usually exposed in the same way
>> that something like the USB stack is, absence of memory safety bugs is
>> a very useful property. If this is attainable without sacrificing
>> performance, it seems like a nice option to offer future block device
>> driver developers. Some would argue that it is worth offering even in
>> the face of performance regression.
>> 
>> While memory safety is the primary feature that Rust brings to the
>> table, it does come with other nice features as well. The type system,
>> language support stackless coroutines and error handling language
>> support are all very useful.
>
> We're in violent agreement on this part, I don't think anyone sane would
> argue that memory safety with the same performance [1] isn't something
> you'd want. And the error handling with rust is so much better than the
> C stuff drivers do now that I can't see anyone disagreeing on that being
> a great thing as well.
>
> The discussion point here is the price being paid in terms of people
> time.
>
>> Regarding maintenance of the bindings, it _is_ an amount extra work. But
>> there is more than one way to structure that work. If Rust is accepted
>> into the block layer at some point, maintenance could be structured in
>> such a way that it does not get in the way of existing C maintenance
>> work. A "rust keeps up or it breaks" model. That could work for a while.
>
> That potentially works for null_blk, but it would not work for anything
> that people actually depend on. In other words, anything that isn't
> null_blk. And I don't believe we'd be actively discussing these bindings
> if just doing null_blk is the end goal, because that isn't useful by
> itself, and at that point we'd all just be wasting our time. In the real
> world, once we have just one actual driver using it, then we'd be
> looking at "this driver regressed because of change X/Y/Z and that needs
> to get sorted before the next release". And THAT is the real issue for
> me. So a "rust keeps up or it breaks" model is a bit naive in my
> opinion, it's just not a viable approach. In fact, even for null_blk,
> this doesn't really fly as we rely on blktests to continually vet the
> sanity of the IO stack, and null_blk is an integral part of that.

Sure, once there are actual users, this model would not work. But during
an introduction period it might be a useful model. Having Rust around
without having to take care of it might give
maintainers,reviewers,contributors a no strings attached opportunity to
dabble with the language in a domain they are familiar with.

>
> So I really don't think there's much to debate between "rust people vs
> jens" here, as we agree on the benefits, but my end of the table has to
> stomach the cons. And like I mentioned in an earlier email, that's not
> just on me, there are other regular contributors and reviewers that are
> relevant to this discussion. This is something we need to discuss.
>
> [1] We obviously need to do real numbers here, the ones posted I don't
> consider stable enough to be useful in saying "yeah it's fully on part".
> If you have an updated rust nvme driver that uses these bindings I'd
> be happy to run some testing that will definitively tell us if there's a
> performance win, loss, or parity, and how much.

I do plan to rebase the NVMe driver somewhere in the next few months.
I'll let you know when that work is done.

Best regards
Andreas

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 01/11] rust: add radix tree abstraction
  2023-05-05  4:49     ` Andreas Hindborg
@ 2023-05-05  5:28       ` Matthew Wilcox
  2023-05-05  6:09         ` Christoph Hellwig
  0 siblings, 1 reply; 69+ messages in thread
From: Matthew Wilcox @ 2023-05-05  5:28 UTC (permalink / raw)
  To: Andreas Hindborg
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho, Boqun Feng,
	Gary Guo, Björn Roy Baron, Benno Lossin, open list,
	gost.dev

On Fri, May 05, 2023 at 06:49:49AM +0200, Andreas Hindborg wrote:
> 
> Matthew Wilcox <willy@infradead.org> writes:
> 
> > On Wed, May 03, 2023 at 11:06:58AM +0200, Andreas Hindborg wrote:
> >> From: Andreas Hindborg <a.hindborg@samsung.com>
> >> 
> >> Add abstractions for the C radix_tree. This abstraction allows Rust code to use
> >> the radix_tree as a map from `u64` keys to `ForeignOwnable` values.
> >
> > Please, no.  The XArray interface is the preferred one; the radix tree
> > is legacy.  Don't make Rust code use the radix tree.  It has the GFP
> > arguments in the wrong place, for one thing.
> 
> I have a similar argument to not using xarrray as to not using folios,
> see my other response. But the Rust xarray API is in the works [1].

But the radix tree API is simply a different (and worse) API to the
exact same data structure as the XArray.  Using the XArray instead of
the radix tree is still an apples-to-apples comparison.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
  2023-05-04 18:36     ` Bart Van Assche
  2023-05-04 18:46       ` Andreas Hindborg
  2023-05-04 18:52       ` Keith Busch
@ 2023-05-05  5:28       ` Hannes Reinecke
  2 siblings, 0 replies; 69+ messages in thread
From: Hannes Reinecke @ 2023-05-05  5:28 UTC (permalink / raw)
  To: Bart Van Assche, Andreas Hindborg
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	lsf-pc, rust-for-linux, linux-block, Matthew Wilcox,
	Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho, Boqun Feng,
	Gary Guo, Björn Roy Baron, Benno Lossin, open list,
	gost.dev

On 5/4/23 20:36, Bart Van Assche wrote:
> On 5/4/23 11:15, Andreas Hindborg wrote:
>> If it is still unclear to you why this effort was started, please do let
>> me know and I shall try to clarify further :)
> 
> It seems like I was too polite in my previous email. What I meant is that
> rewriting code is useful if it provides a clear advantage to the users of
> a driver. For null_blk, the users are kernel developers. The code that has
> been posted is the start of a rewrite of the null_blk driver. The benefits
> of this rewrite (making low-level memory errors less likely) do not 
> outweigh
> the risks that this effort will introduce functional or performance 
> regressions.
> 
I have to disagree here. While the null_blk driver in itself is 
certainly not _that_ useful, it does provide a good sounding board if 
all the design principles of the linux block layer can be adequately 
expressed in Rust.

And by posting this driver you just proved that, and we all have a 
better understanding what would be needed to convert old or create new 
drivers.

But I guess we'll have a longer discussion at LSF :-)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 02/11] rust: add `pages` module for handling page allocation
  2023-05-05  4:42     ` Andreas Hindborg
@ 2023-05-05  5:29       ` Matthew Wilcox
  0 siblings, 0 replies; 69+ messages in thread
From: Matthew Wilcox @ 2023-05-05  5:29 UTC (permalink / raw)
  To: Andreas Hindborg
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho, Boqun Feng,
	Gary Guo, Björn Roy Baron, Benno Lossin, open list,
	gost.dev

On Fri, May 05, 2023 at 06:42:02AM +0200, Andreas Hindborg wrote:
> 
> Matthew Wilcox <willy@infradead.org> writes:
> 
> > On Wed, May 03, 2023 at 11:06:59AM +0200, Andreas Hindborg wrote:
> >> From: Andreas Hindborg <a.hindborg@samsung.com>
> >> 
> >> This patch adds support for working with pages of order 0. Support for pages
> >> with higher order is deferred. Page allocation flags are fixed in this patch.
> >> Future work might allow the user to specify allocation flags.
> >> 
> >> This patch is a heavily modified version of code available in the rust tree [1],
> >> primarily adding support for multiple page mapping strategies.
> >
> > This also seems misaligned with the direction of Linux development.
> > Folios are the future, pages are legacy.  Please, ask about what's
> > going on before wasting time on the past.
> 
> I see, thanks for the heads up! In this case I wanted to do an
> apples-apples comparison to the C null_blk driver. Since that is using
> kmap I wanted to have that. But let's bind to the folio_* APIs in the
> future, that would make sense.

Well, kmap() is essentially a no-op on 64-bit systems, so it's not
terribly relevant to doing a comparison.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 01/11] rust: add radix tree abstraction
  2023-05-05  5:28       ` Matthew Wilcox
@ 2023-05-05  6:09         ` Christoph Hellwig
  2023-05-05  8:33           ` Chaitanya Kulkarni
  0 siblings, 1 reply; 69+ messages in thread
From: Christoph Hellwig @ 2023-05-05  6:09 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andreas Hindborg, Jens Axboe, Christoph Hellwig, Keith Busch,
	Damien Le Moal, Hannes Reinecke, lsf-pc, rust-for-linux,
	linux-block, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, Benno Lossin,
	open list, gost.dev

On Fri, May 05, 2023 at 06:28:04AM +0100, Matthew Wilcox wrote:
> But the radix tree API is simply a different (and worse) API to the
> exact same data structure as the XArray.  Using the XArray instead of
> the radix tree is still an apples-to-apples comparison.

And if he wants to do a 1:1 comparism, just convert null_blk to
xarray, which would actually be more useful work than this whole
thread..

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 01/11] rust: add radix tree abstraction
  2023-05-05  6:09         ` Christoph Hellwig
@ 2023-05-05  8:33           ` Chaitanya Kulkarni
  0 siblings, 0 replies; 69+ messages in thread
From: Chaitanya Kulkarni @ 2023-05-05  8:33 UTC (permalink / raw)
  To: Christoph Hellwig, Matthew Wilcox
  Cc: Andreas Hindborg, Jens Axboe, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho, Boqun Feng,
	Gary Guo, Björn Roy Baron, Benno Lossin, open list,
	gost.dev

On 5/4/23 23:09, Christoph Hellwig wrote:
> On Fri, May 05, 2023 at 06:28:04AM +0100, Matthew Wilcox wrote:
>> But the radix tree API is simply a different (and worse) API to the
>> exact same data structure as the XArray.  Using the XArray instead of
>> the radix tree is still an apples-to-apples comparison.
> And if he wants to do a 1:1 comparism, just convert null_blk to
> xarray, which would actually be more useful work than this whole
> thread..

please don't submit patches in this area as I'm already working that waiting
for few things to go upstream first, got busy with testing and bug fixing
memleaks for next pull request ..

-ck



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
  2023-05-04 20:22             ` Jens Axboe
@ 2023-05-05 10:53               ` Miguel Ojeda
  2023-05-05 12:24                 ` Boqun Feng
  2023-05-05 19:38                 ` Bart Van Assche
  0 siblings, 2 replies; 69+ messages in thread
From: Miguel Ojeda @ 2023-05-05 10:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Bart Van Assche, Andreas Hindborg,
	Christoph Hellwig, Damien Le Moal, Hannes Reinecke, lsf-pc,
	rust-for-linux, linux-block, Matthew Wilcox, Miguel Ojeda,
	Alex Gaynor, Wedson Almeida Filho, Boqun Feng, Gary Guo,
	Björn Roy Baron, Benno Lossin, open list, gost.dev,
	Daniel Vetter

On Thu, May 4, 2023 at 10:22 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> Right, but that doesn't really solve the problem when the rust bindings
> get in the way of changes that you are currently making. Or if you break
> them inadvertently. I do see benefits to that approach, but it's no
> panacea.
>
> This seems to assume that time is plentiful and we can just add more to
> our plate, which isn't always true. While I'd love to do more rust and
> get more familiar with it, the time still has to be there for that. I'm
> actually typing this on a laptop with a rust gpu driver :-)
>
> And this isn't just on me, there are other regular contributors and
> reviewers that would need to be onboard with this.

Indeed -- I didn't mean to imply it wouldn't be time consuming, only
that it might be an alternative approach compared to having existing
maintainers do it. Of course, it depends on the dynamics of the
subsystem, how busy the subsystem is, whether there is good rapport,
etc.

> Each case is different though, different people and different schedules
> and priorities. So while the above is promising, it's also just
> annecdotal and doesn't necessarily apply to our case.

Definitely, in the end subsystems know best if there is enough time
available (from everybody) to pull it off. I only meant to say that
the security angle is not the only benefit.

For instance, like you said, the error handling, plus a bunch more
that people usually enjoy: stricter typing, more information on
signatures, sum types, pattern matching, privacy, closures, generics,
etc.

Cheers,
Miguel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
  2023-05-04 20:55             ` Jens Axboe
  2023-05-05  5:06               ` Andreas Hindborg
@ 2023-05-05 11:14               ` Miguel Ojeda
  1 sibling, 0 replies; 69+ messages in thread
From: Miguel Ojeda @ 2023-05-05 11:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andreas Hindborg, Keith Busch, Bart Van Assche,
	Christoph Hellwig, Damien Le Moal, Hannes Reinecke, lsf-pc,
	rust-for-linux, linux-block, Matthew Wilcox, Miguel Ojeda,
	Alex Gaynor, Wedson Almeida Filho, Boqun Feng, Gary Guo,
	Björn Roy Baron, Benno Lossin, open list, gost.dev

On Thu, May 4, 2023 at 10:55 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> That potentially works for null_blk, but it would not work for anything
> that people actually depend on. In other words, anything that isn't
> null_blk. And I don't believe we'd be actively discussing these bindings
> if just doing null_blk is the end goal, because that isn't useful by
> itself, and at that point we'd all just be wasting our time. In the real
> world, once we have just one actual driver using it, then we'd be
> looking at "this driver regressed because of change X/Y/Z and that needs
> to get sorted before the next release". And THAT is the real issue for
> me. So a "rust keeps up or it breaks" model is a bit naive in my
> opinion, it's just not a viable approach. In fact, even for null_blk,
> this doesn't really fly as we rely on blktests to continually vet the
> sanity of the IO stack, and null_blk is an integral part of that.

But `null_blk` shouldn't be affected, no? The Rust abstractions can be
behind an explicit "experimental" / "broken" / "compile-test only"
gate (or similar) in the beginning, as a way to test how much
maintenance it actually requires.

In such a setup, Andreas could be the one responsible to keep them up
to date in the beginning. That is, in the worst case, a kernel release
could happen with the Rust side broken -- that way `null_blk` is not
impacted.

That is why a separate `MAINTAINERS` entry may be interesting if you
want to keep e.g. the `S:` level separate (though Andreas, I think,
may be able to do "Supported" at this time).

When the first real driver comes, a similar approach could be
repeated, to buy some more time too.

Cheers,
Miguel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
  2023-05-05 10:53               ` Miguel Ojeda
@ 2023-05-05 12:24                 ` Boqun Feng
  2023-05-05 13:52                   ` Boqun Feng
  2023-05-05 19:42                   ` Keith Busch
  2023-05-05 19:38                 ` Bart Van Assche
  1 sibling, 2 replies; 69+ messages in thread
From: Boqun Feng @ 2023-05-05 12:24 UTC (permalink / raw)
  To: Miguel Ojeda
  Cc: Jens Axboe, Keith Busch, Bart Van Assche, Andreas Hindborg,
	Christoph Hellwig, Damien Le Moal, Hannes Reinecke, lsf-pc,
	rust-for-linux, linux-block, Matthew Wilcox, Miguel Ojeda,
	Alex Gaynor, Wedson Almeida Filho, Gary Guo,
	Björn Roy Baron, Benno Lossin, open list, gost.dev,
	Daniel Vetter

On Fri, May 05, 2023 at 12:53:41PM +0200, Miguel Ojeda wrote:
> On Thu, May 4, 2023 at 10:22 PM Jens Axboe <axboe@kernel.dk> wrote:
> >
> > Right, but that doesn't really solve the problem when the rust bindings
> > get in the way of changes that you are currently making. Or if you break
> > them inadvertently. I do see benefits to that approach, but it's no
> > panacea.

One thing I want to point out is: not having a block layer Rust API
doesn't keep the block layer away from Rust ;-) Rust will get in the way
as long as block layer is used, directly or indirectly, in any Rust code
in kernel.

Take the M1 GPU driver for example, it can totally be done without a drm
Rust API: Lina will have to directly call C funciton in her GPU driver,
but it's possible or she can have her own drm Rust binding which is not
blessed by the drm maintainers. But as long as drm is used in a Rust
driver, a refactoring/improvement of drm will need to take the usage of
Rust side into consideration. Unless of course, some one is willing to
write a C driver for M1 GPU.

The Rust bindings are actually the way of communication between
subsystem mantainers and Rust driver writers, and can help reduce the
amount of work: You can have the abstraction the way you like.

Of course, there is always "don't do it until there are actually users",
and I totally agree with that. But what is a better way to design the
Rust binding for a subsystem?

*	Sit down and use the wisdom of maintainers and active
	developers, and really spend time on it right now? Or

*	Let one future user drag the API/binding design to insaneness?

I'd rather prefer the first approach. Time spent is time saved.

Personally, my biggest fear is: RCU stalls/lockdep warnings in the Rust
code (or they don't happen because incorrect bindings), and who is going
to fix them ;-) So I have to spend my time on making sure these bindings
in good shapes, which is not always a pleasant experience: the more you
use something, the more you hate it ;-) But I think it's worth.

Of course, by no means I want to force anyone to learn Rust, I totally
understand people who want to see zero Rust. Just want to say the
maintain burden may exist any way, and the Rust binding is actually the
thing to help here.

Regards,
Boqun

> >
> > This seems to assume that time is plentiful and we can just add more to
> > our plate, which isn't always true. While I'd love to do more rust and
> > get more familiar with it, the time still has to be there for that. I'm
> > actually typing this on a laptop with a rust gpu driver :-)
> >
> > And this isn't just on me, there are other regular contributors and
> > reviewers that would need to be onboard with this.
> 
> Indeed -- I didn't mean to imply it wouldn't be time consuming, only
> that it might be an alternative approach compared to having existing
> maintainers do it. Of course, it depends on the dynamics of the
> subsystem, how busy the subsystem is, whether there is good rapport,
> etc.
> 
> > Each case is different though, different people and different schedules
> > and priorities. So while the above is promising, it's also just
> > annecdotal and doesn't necessarily apply to our case.
> 
> Definitely, in the end subsystems know best if there is enough time
> available (from everybody) to pull it off. I only meant to say that
> the security angle is not the only benefit.
> 
> For instance, like you said, the error handling, plus a bunch more
> that people usually enjoy: stricter typing, more information on
> signatures, sum types, pattern matching, privacy, closures, generics,
> etc.
> 
> Cheers,
> Miguel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
  2023-05-05 12:24                 ` Boqun Feng
@ 2023-05-05 13:52                   ` Boqun Feng
  2023-05-05 19:42                   ` Keith Busch
  1 sibling, 0 replies; 69+ messages in thread
From: Boqun Feng @ 2023-05-05 13:52 UTC (permalink / raw)
  To: Miguel Ojeda
  Cc: Jens Axboe, Keith Busch, Bart Van Assche, Andreas Hindborg,
	Christoph Hellwig, Damien Le Moal, Hannes Reinecke, lsf-pc,
	rust-for-linux, linux-block, Matthew Wilcox, Miguel Ojeda,
	Alex Gaynor, Wedson Almeida Filho, Gary Guo,
	Björn Roy Baron, Benno Lossin, open list, gost.dev,
	Daniel Vetter

On Fri, May 05, 2023 at 05:24:56AM -0700, Boqun Feng wrote:
> On Fri, May 05, 2023 at 12:53:41PM +0200, Miguel Ojeda wrote:
> > On Thu, May 4, 2023 at 10:22 PM Jens Axboe <axboe@kernel.dk> wrote:
> > >
> > > Right, but that doesn't really solve the problem when the rust bindings
> > > get in the way of changes that you are currently making. Or if you break
> > > them inadvertently. I do see benefits to that approach, but it's no
> > > panacea.
> 
> One thing I want to point out is: not having a block layer Rust API
> doesn't keep the block layer away from Rust ;-) Rust will get in the way
> as long as block layer is used, directly or indirectly, in any Rust code
> in kernel.
> 
> Take the M1 GPU driver for example, it can totally be done without a drm
> Rust API: Lina will have to directly call C funciton in her GPU driver,
> but it's possible or she can have her own drm Rust binding which is not
> blessed by the drm maintainers. But as long as drm is used in a Rust
> driver, a refactoring/improvement of drm will need to take the usage of
> Rust side into consideration. Unless of course, some one is willing to
> write a C driver for M1 GPU.
> 
> The Rust bindings are actually the way of communication between
> subsystem mantainers and Rust driver writers, and can help reduce the
> amount of work: You can have the abstraction the way you like.
> 
> Of course, there is always "don't do it until there are actually users",
> and I totally agree with that. But what is a better way to design the
> Rust binding for a subsystem?
> 
> *	Sit down and use the wisdom of maintainers and active
> 	developers, and really spend time on it right now? Or
> 
> *	Let one future user drag the API/binding design to insaneness?
> 

Ah, of course, I should add: this is not the usual case, most of the
time, users (e.g. a real driver) can help the design, I was just trying
to say: without the wisdom of maintainers and active developers, a Rust
binding solely designed by one user could have some design issues. In
other words, the experience of maintaining C side API is very valuable
to design Rust bindings.

Regards,
Boqun

> I'd rather prefer the first approach. Time spent is time saved.
> 
> Personally, my biggest fear is: RCU stalls/lockdep warnings in the Rust
> code (or they don't happen because incorrect bindings), and who is going
> to fix them ;-) So I have to spend my time on making sure these bindings
> in good shapes, which is not always a pleasant experience: the more you
> use something, the more you hate it ;-) But I think it's worth.
> 
> Of course, by no means I want to force anyone to learn Rust, I totally
> understand people who want to see zero Rust. Just want to say the
> maintain burden may exist any way, and the Rust binding is actually the
> thing to help here.
> 
> Regards,
> Boqun
> 
> > >
> > > This seems to assume that time is plentiful and we can just add more to
> > > our plate, which isn't always true. While I'd love to do more rust and
> > > get more familiar with it, the time still has to be there for that. I'm
> > > actually typing this on a laptop with a rust gpu driver :-)
> > >
> > > And this isn't just on me, there are other regular contributors and
> > > reviewers that would need to be onboard with this.
> > 
> > Indeed -- I didn't mean to imply it wouldn't be time consuming, only
> > that it might be an alternative approach compared to having existing
> > maintainers do it. Of course, it depends on the dynamics of the
> > subsystem, how busy the subsystem is, whether there is good rapport,
> > etc.
> > 
> > > Each case is different though, different people and different schedules
> > > and priorities. So while the above is promising, it's also just
> > > annecdotal and doesn't necessarily apply to our case.
> > 
> > Definitely, in the end subsystems know best if there is enough time
> > available (from everybody) to pull it off. I only meant to say that
> > the security angle is not the only benefit.
> > 
> > For instance, like you said, the error handling, plus a bunch more
> > that people usually enjoy: stricter typing, more information on
> > signatures, sum types, pattern matching, privacy, closures, generics,
> > etc.
> > 
> > Cheers,
> > Miguel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
  2023-05-05 10:53               ` Miguel Ojeda
  2023-05-05 12:24                 ` Boqun Feng
@ 2023-05-05 19:38                 ` Bart Van Assche
  1 sibling, 0 replies; 69+ messages in thread
From: Bart Van Assche @ 2023-05-05 19:38 UTC (permalink / raw)
  To: Miguel Ojeda, Jens Axboe
  Cc: Keith Busch, Andreas Hindborg, Christoph Hellwig, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Matthew Wilcox, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, Benno Lossin,
	open list, gost.dev, Daniel Vetter

On 5/5/23 03:53, Miguel Ojeda wrote:
> Definitely, in the end subsystems know best if there is enough time
> available (from everybody) to pull it off. I only meant to say that
> the security angle is not the only benefit.
> 
> For instance, like you said, the error handling, plus a bunch more
> that people usually enjoy: stricter typing, more information on
> signatures, sum types, pattern matching, privacy, closures, generics,
> etc.

These are all great advantages of Rust.

One potential cause of memory corruption caused by block drivers is
misprogramming the DMA engine of the storage controller. This is something
no borrow checker can protect against. Only an IOMMU can protect against
the storage controller accessing memory that it shouldn't access. This is
not a criticism of Rust - I'm bringing this up because I think this is
something that is important to realize.

Bart.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
  2023-05-05 12:24                 ` Boqun Feng
  2023-05-05 13:52                   ` Boqun Feng
@ 2023-05-05 19:42                   ` Keith Busch
  2023-05-05 21:46                     ` Boqun Feng
  1 sibling, 1 reply; 69+ messages in thread
From: Keith Busch @ 2023-05-05 19:42 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Miguel Ojeda, Jens Axboe, Bart Van Assche, Andreas Hindborg,
	Christoph Hellwig, Damien Le Moal, Hannes Reinecke, lsf-pc,
	rust-for-linux, linux-block, Matthew Wilcox, Miguel Ojeda,
	Alex Gaynor, Wedson Almeida Filho, Gary Guo,
	Björn Roy Baron, Benno Lossin, open list, gost.dev,
	Daniel Vetter

On Fri, May 05, 2023 at 05:24:56AM -0700, Boqun Feng wrote:
> 
> The Rust bindings are actually the way of communication between
> subsystem mantainers and Rust driver writers, and can help reduce the
> amount of work: You can have the abstraction the way you like.

We don't have stable APIs or structures here, so let's be clear-eyed
about the maintenance burden these bindings create for linux-block
contributors. Not a hard "no" from me, but this isn't something to
handwave over.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
  2023-05-05 19:42                   ` Keith Busch
@ 2023-05-05 21:46                     ` Boqun Feng
  0 siblings, 0 replies; 69+ messages in thread
From: Boqun Feng @ 2023-05-05 21:46 UTC (permalink / raw)
  To: Keith Busch
  Cc: Miguel Ojeda, Jens Axboe, Bart Van Assche, Andreas Hindborg,
	Christoph Hellwig, Damien Le Moal, Hannes Reinecke, lsf-pc,
	rust-for-linux, linux-block, Matthew Wilcox, Miguel Ojeda,
	Alex Gaynor, Wedson Almeida Filho, Gary Guo,
	Björn Roy Baron, Benno Lossin, open list, gost.dev,
	Daniel Vetter

On Fri, May 05, 2023 at 01:42:42PM -0600, Keith Busch wrote:
> On Fri, May 05, 2023 at 05:24:56AM -0700, Boqun Feng wrote:
> > 
> > The Rust bindings are actually the way of communication between
> > subsystem mantainers and Rust driver writers, and can help reduce the
> > amount of work: You can have the abstraction the way you like.
> 
> We don't have stable APIs or structures here, so let's be clear-eyed

Of course, but every API change need to cover all in-tree users, right?

> about the maintenance burden these bindings create for linux-block
> contributors. Not a hard "no" from me, but this isn't something to
> handwave over.

Not tried to handwave over anything ;-) The fact IIUC is simply that Rust
drivers can call C function, so say a in-tree Rust driver does something
as follow:

	struct Foo {
		ptr: *mut bio; // A pointer to bio.
		...
	}

	impl Foo {
		pub fn bar(&self) {
			unsafe {
				// calling a C function "do_sth_to_bio".
				// void do_sth_to_bio(struct bio *bio);
				bindings::do_sth_to_bio(self.ptr);
			}
		}
	}

That's an user of the block layer, and that user could exist even
without Bio abstraction. And whenever a linux-block developer wants to
refactor the "do_sth_to_bio" with a slightly different semantics, that
user will need to be taken into consideration, meaning reading the Rust
code of Foo to understand the usage.

Now with a Bio abstraction, the immediate effect would be there should
be no Rust code is allowed to directly calls block layer functions
without using Bio abstraction. And hopefully Bio abstraction along with
other bindings is a place good enough to reasoning about semanitcs
changes or refactoring, so no need to read the code of Foo to understand
the usage. Of course, some C side changes may result into changes in
Rust bindings as well, but that doesn't make things worse. (Need to
understand Foo in that case if there is no Rust bindings).

Of course, this is just my 2 cents. I could be completely wrong.
(Put the Rust-for-Linux hat on) Needless to say with or without the Rust
bindings for the block layer, we are (at least I'm) happy to help on any
Rust related questions/bugs/issues for linux-block ;-)

Regards,
Boqun

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
  2023-05-03  9:06 [RFC PATCH 00/11] Rust null block driver Andreas Hindborg
                   ` (12 preceding siblings ...)
  2023-05-03 16:47 ` Bart Van Assche
@ 2023-05-07 23:31 ` Luis Chamberlain
  2023-05-07 23:37   ` Andreas Hindborg
  2023-07-27  3:45 ` Yexuan Yang
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 69+ messages in thread
From: Luis Chamberlain @ 2023-05-07 23:31 UTC (permalink / raw)
  To: Andreas Hindborg
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Andreas Hindborg, Matthew Wilcox, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	Benno Lossin, open list, gost.dev

On Wed, May 03, 2023 at 11:06:57AM +0200, Andreas Hindborg wrote:
> The statistics presented in my previous message [1] show that the C null block
> driver has had a significant amount of memory safety related problems in the
> past. 41% of fixes merged for the C null block driver are fixes for memory
> safety issues. This makes the null block driver a good candidate for rewriting
> in Rust.

Curious, how long does it take to do an analysis like this? Are there efforts
to automate this a bit more? We have efforts to use machine learning to
evaluate stable candidate patches, we probably should be able to qualify
commits as fixing "memory safety", I figure.

Because what I'd love to see is if we can could easily obtain similar
statistics for arbitrary parts of the kernel. The easiest way to break
this down might be by kconfig symbol for instance, and then based on
that gather more information about subsystems.

Then the rationale for considerating adopting rust bindings for certain areas
of the kernel becomes a bit clearer.

I figured some of this work has already been done, but I just haven't seen it
yet.

  Luis

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
  2023-05-07 23:31 ` Luis Chamberlain
@ 2023-05-07 23:37   ` Andreas Hindborg
  0 siblings, 0 replies; 69+ messages in thread
From: Andreas Hindborg @ 2023-05-07 23:37 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Matthew Wilcox, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, Benno Lossin,
	open list, gost.dev


Luis Chamberlain <mcgrof@kernel.org> writes:

> On Wed, May 03, 2023 at 11:06:57AM +0200, Andreas Hindborg wrote:
>> The statistics presented in my previous message [1] show that the C null block
>> driver has had a significant amount of memory safety related problems in the
>> past. 41% of fixes merged for the C null block driver are fixes for memory
>> safety issues. This makes the null block driver a good candidate for rewriting
>> in Rust.
>
> Curious, how long does it take to do an analysis like this? Are there efforts
> to automate this a bit more? We have efforts to use machine learning to
> evaluate stable candidate patches, we probably should be able to qualify
> commits as fixing "memory safety", I figure.
>
> Because what I'd love to see is if we can could easily obtain similar
> statistics for arbitrary parts of the kernel. The easiest way to break
> this down might be by kconfig symbol for instance, and then based on
> that gather more information about subsystems.
>

I spent around 4 hours with a spreadsheet and git. It would be cool if
that work could be automated. It's not always clear from the commit
heading that a commit is a fix. When it is clear that it is a fix, it
might not be clear what is fixed. I had to look at the diff quite a few
commits.

There is some work mentioning the ratio of memory safety issues fixed in
the kernel, but none of them go into details for specific subsystems as
far as I know. 20% of bugs fixed in stable Linux Kernel branches for
drivers are memory safety issues [1]. 65% of recent Linux kernel
vulnerabilities are memory safety issues [2]

> Then the rationale for considerating adopting rust bindings for certain areas
> of the kernel becomes a bit clearer.

As mentioned elsewhere in this thread there are other benefits from
deploying Rust than provable absence of memory safety issues.

Best regards
Andreas

[1] http://dx.doi.org/10.15514/ISPRAS-2018-30(6)-8
[2] https://lssna19.sched.com/event/RHaT/writing-linux-kernel-modules-in-safe-rust-geoffrey-thomas-two-sigma-investments-alex-gaynor-alloy

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 03/11] rust: block: introduce `kernel::block::mq` module
  2023-05-03  9:07 ` [RFC PATCH 03/11] rust: block: introduce `kernel::block::mq` module Andreas Hindborg
@ 2023-05-08 12:29   ` Benno Lossin
  2023-05-11  6:52     ` Sergio González Collado
  2024-01-12  9:18     ` Andreas Hindborg (Samsung)
  0 siblings, 2 replies; 69+ messages in thread
From: Benno Lossin @ 2023-05-08 12:29 UTC (permalink / raw)
  To: Andreas Hindborg
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Andreas Hindborg, Matthew Wilcox, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	linux-kernel, gost.dev

I have left some comments below. Some of them are not really
suggestions, but rather I would like to know the rationale of the
design, as I am not familiar at all with the C side and have mostly
no idea what the called functions do.

On Wednesday, May 3rd, 2023 at 11:07, Andreas Hindborg <nmi@metaspace.dk> wrote:
> From: Andreas Hindborg <a.hindborg@samsung.com>
> 
> Add initial abstractions for working with blk-mq.
> 
> This patch is a maintained, refactored subset of code originally published by
> Wedson Almeida Filho <wedsonaf@gmail.com> [1].
> 
> [1] https://github.com/wedsonaf/linux/tree/f2cfd2fe0e2ca4e90994f96afe268bbd4382a891/rust/kernel/blk/mq.rs
> 
> Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
> Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com>
> ---
>  rust/bindings/bindings_helper.h    |   2 +
>  rust/helpers.c                     |  22 +++
>  rust/kernel/block.rs               |   5 +
>  rust/kernel/block/mq.rs            |  15 ++
>  rust/kernel/block/mq/gen_disk.rs   | 133 +++++++++++++++
>  rust/kernel/block/mq/operations.rs | 260 +++++++++++++++++++++++++++++
>  rust/kernel/block/mq/raw_writer.rs |  30 ++++
>  rust/kernel/block/mq/request.rs    |  71 ++++++++
>  rust/kernel/block/mq/tag_set.rs    |  92 ++++++++++
>  rust/kernel/error.rs               |   4 +
>  rust/kernel/lib.rs                 |   1 +
>  11 files changed, 635 insertions(+)
>  create mode 100644 rust/kernel/block.rs
>  create mode 100644 rust/kernel/block/mq.rs
>  create mode 100644 rust/kernel/block/mq/gen_disk.rs
>  create mode 100644 rust/kernel/block/mq/operations.rs
>  create mode 100644 rust/kernel/block/mq/raw_writer.rs
>  create mode 100644 rust/kernel/block/mq/request.rs
>  create mode 100644 rust/kernel/block/mq/tag_set.rs
> 
> diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h
> index 52834962b94d..86c07eeb1ba1 100644
> --- a/rust/bindings/bindings_helper.h
> +++ b/rust/bindings/bindings_helper.h
> @@ -11,6 +11,8 @@
>  #include <linux/wait.h>
>  #include <linux/sched.h>
>  #include <linux/radix-tree.h>
> +#include <linux/blk_types.h>
> +#include <linux/blk-mq.h>
> 
>  /* `bindgen` gets confused at certain things. */
>  const gfp_t BINDINGS_GFP_KERNEL = GFP_KERNEL;
> diff --git a/rust/helpers.c b/rust/helpers.c
> index 9bd9d95da951..a59341084774 100644
> --- a/rust/helpers.c
> +++ b/rust/helpers.c
> @@ -18,6 +18,7 @@
>   * accidentally exposed.
>   */
> 
> +#include <linux/bio.h>
>  #include <linux/bug.h>
>  #include <linux/build_bug.h>
>  #include <linux/err.h>
> @@ -28,6 +29,8 @@
>  #include <linux/wait.h>
>  #include <linux/radix-tree.h>
>  #include <linux/highmem.h>
> +#include <linux/blk-mq.h>
> +#include <linux/blkdev.h>
> 
>  __noreturn void rust_helper_BUG(void)
>  {
> @@ -130,6 +133,25 @@ void rust_helper_put_task_struct(struct task_struct *t)
>  }
>  EXPORT_SYMBOL_GPL(rust_helper_put_task_struct);
> 
> +struct bio_vec rust_helper_req_bvec(struct request *rq)
> +{
> +	return req_bvec(rq);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_req_bvec);
> +
> +void *rust_helper_blk_mq_rq_to_pdu(struct request *rq)
> +{
> +	return blk_mq_rq_to_pdu(rq);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_blk_mq_rq_to_pdu);
> +
> +void rust_helper_bio_advance_iter_single(const struct bio *bio,
> +                                         struct bvec_iter *iter,
> +                                         unsigned int bytes) {
> +  bio_advance_iter_single(bio, iter, bytes);
> +}
> +EXPORT_SYMBOL_GPL(rust_helper_bio_advance_iter_single);
> +
>  void rust_helper_init_radix_tree(struct xarray *tree, gfp_t gfp_mask)
>  {
>  	INIT_RADIX_TREE(tree, gfp_mask);
> diff --git a/rust/kernel/block.rs b/rust/kernel/block.rs
> new file mode 100644
> index 000000000000..4c93317a568a
> --- /dev/null
> +++ b/rust/kernel/block.rs
> @@ -0,0 +1,5 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +//! Types for working with the block layer
> +
> +pub mod mq;
> diff --git a/rust/kernel/block/mq.rs b/rust/kernel/block/mq.rs
> new file mode 100644
> index 000000000000..5b40f6a73c0f
> --- /dev/null
> +++ b/rust/kernel/block/mq.rs
> @@ -0,0 +1,15 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +//! This module provides types for implementing drivers that interface the
> +//! blk-mq subsystem
> +
> +mod gen_disk;
> +mod operations;
> +mod raw_writer;
> +mod request;
> +mod tag_set;
> +
> +pub use gen_disk::GenDisk;
> +pub use operations::Operations;
> +pub use request::Request;
> +pub use tag_set::TagSet;
> diff --git a/rust/kernel/block/mq/gen_disk.rs b/rust/kernel/block/mq/gen_disk.rs
> new file mode 100644
> index 000000000000..50496af15bbf
> --- /dev/null
> +++ b/rust/kernel/block/mq/gen_disk.rs
> @@ -0,0 +1,133 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +//! GenDisk abstraction
> +//!
> +//! C header: [`include/linux/blkdev.h`](../../include/linux/blkdev.h)
> +//! C header: [`include/linux/blk_mq.h`](../../include/linux/blk_mq.h)
> +
> +use crate::block::mq::{raw_writer::RawWriter, Operations, TagSet};
> +use crate::{
> +    bindings, error::from_err_ptr, error::Result, sync::Arc, types::ForeignOwnable,
> +    types::ScopeGuard,
> +};
> +use core::fmt::{self, Write};
> +
> +/// A generic block device
> +///
> +/// # Invariants
> +///
> +///  - `gendisk` must always point to an initialized and valid `struct gendisk`.
> +pub struct GenDisk<T: Operations> {
> +    _tagset: Arc<TagSet<T>>,
> +    gendisk: *mut bindings::gendisk,

Why are these two fields not embedded? Shouldn't the user decide where
to allocate?

> +}
> +
> +// SAFETY: `GenDisk` is an owned pointer to a `struct gendisk` and an `Arc` to a
> +// `TagSet` It is safe to send this to other threads as long as T is Send.
> +unsafe impl<T: Operations + Send> Send for GenDisk<T> {}
> +
> +impl<T: Operations> GenDisk<T> {
> +    /// Try to create a new `GenDisk`
> +    pub fn try_new(tagset: Arc<TagSet<T>>, queue_data: T::QueueData) -> Result<Self> {
> +        let data = queue_data.into_foreign();
> +        let recover_data = ScopeGuard::new(|| {
> +            // SAFETY: T::QueueData was created by the call to `into_foreign()` above
> +            unsafe { T::QueueData::from_foreign(data) };
> +        });
> +
> +        let lock_class_key = crate::sync::LockClassKey::new();
> +
> +        // SAFETY: `tagset.raw_tag_set()` points to a valid and initialized tag set
> +        let gendisk = from_err_ptr(unsafe {
> +            bindings::__blk_mq_alloc_disk(tagset.raw_tag_set(), data as _, lock_class_key.as_ptr())

Avoid `as _` casts.

> +        })?;
> +
> +        const TABLE: bindings::block_device_operations = bindings::block_device_operations {
> +            submit_bio: None,
> +            open: None,
> +            release: None,
> +            ioctl: None,
> +            compat_ioctl: None,
> +            check_events: None,
> +            unlock_native_capacity: None,
> +            getgeo: None,
> +            set_read_only: None,
> +            swap_slot_free_notify: None,
> +            report_zones: None,
> +            devnode: None,
> +            alternative_gpt_sector: None,
> +            get_unique_id: None,
> +            owner: core::ptr::null_mut(),
> +            pr_ops: core::ptr::null_mut(),
> +            free_disk: None,
> +            poll_bio: None,
> +        };
> +
> +        // SAFETY: gendisk is a valid pointer as we initialized it above
> +        unsafe { (*gendisk).fops = &TABLE };
> +
> +        recover_data.dismiss();
> +        Ok(Self {
> +            _tagset: tagset,
> +            gendisk,
> +        })
> +    }
> +
> +    /// Set the name of the device
> +    pub fn set_name(&self, args: fmt::Arguments<'_>) -> Result {
> +        let mut raw_writer = RawWriter::from_array(unsafe { &mut (*self.gendisk).disk_name });

Missing `SAFETY` also see below.

> +        raw_writer.write_fmt(args)?;
> +        raw_writer.write_char('\0')?;
> +        Ok(())
> +    }
> +
> +    /// Register the device with the kernel. When this function return, the
> +    /// device is accessible from VFS. The kernel may issue reads to the device
> +    /// during registration to discover partition infomation.
> +    pub fn add(&self) -> Result {
> +        crate::error::to_result(unsafe {
> +            bindings::device_add_disk(core::ptr::null_mut(), self.gendisk, core::ptr::null_mut())
> +        })
> +    }
> +
> +    /// Call to tell the block layer the capcacity of the device
> +    pub fn set_capacity(&self, sectors: u64) {
> +        unsafe { bindings::set_capacity(self.gendisk, sectors) };
> +    }
> +
> +    /// Set the logical block size of the device
> +    pub fn set_queue_logical_block_size(&self, size: u32) {
> +        unsafe { bindings::blk_queue_logical_block_size((*self.gendisk).queue, size) };
> +    }
> +
> +    /// Set the physical block size of the device

What does this *do*? I do not think the doc string gives any meaningful
information not present in the function name (this might just be,
because I have no idea of what this is and anyone with just a little
more knowledge would know, but I still wanted to mention it).

> +    pub fn set_queue_physical_block_size(&self, size: u32) {
> +        unsafe { bindings::blk_queue_physical_block_size((*self.gendisk).queue, size) };
> +    }
> +
> +    /// Set the rotational media attribute for the device
> +    pub fn set_rotational(&self, rotational: bool) {
> +        if !rotational {
> +            unsafe {
> +                bindings::blk_queue_flag_set(bindings::QUEUE_FLAG_NONROT, (*self.gendisk).queue)
> +            };
> +        } else {
> +            unsafe {
> +                bindings::blk_queue_flag_clear(bindings::QUEUE_FLAG_NONROT, (*self.gendisk).queue)
> +            };
> +        }
> +    }
> +}
> +
> +impl<T: Operations> Drop for GenDisk<T> {
> +    fn drop(&mut self) {
> +        let queue_data = unsafe { (*(*self.gendisk).queue).queuedata };
> +
> +        unsafe { bindings::del_gendisk(self.gendisk) };
> +
> +        // SAFETY: `queue.queuedata` was created by `GenDisk::try_new()` with a
> +        // call to `ForeignOwnable::into_pointer()` to create `queuedata`.
> +        // `ForeignOwnable::from_foreign()` is only called here.
> +        let _queue_data = unsafe { T::QueueData::from_foreign(queue_data) };
> +    }
> +}
> diff --git a/rust/kernel/block/mq/operations.rs b/rust/kernel/block/mq/operations.rs
> new file mode 100644
> index 000000000000..fb1ab707d1f0
> --- /dev/null
> +++ b/rust/kernel/block/mq/operations.rs
> @@ -0,0 +1,260 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +//! This module provides an interface for blk-mq drivers to implement.
> +//!
> +//! C header: [`include/linux/blk-mq.h`](../../include/linux/blk-mq.h)
> +
> +use crate::{
> +    bindings,
> +    block::mq::{tag_set::TagSetRef, Request},
> +    error::{from_result, Result},
> +    types::ForeignOwnable,
> +};
> +use core::{marker::PhantomData, pin::Pin};
> +
> +/// Implement this trait to interface blk-mq as block devices
> +#[macros::vtable]
> +pub trait Operations: Sized {

Is this trait really safe? Are there **no** requirements for e.g.
`QueueData`? So could I use `Box<()>`?

> +    /// Data associated with a request. This data is located next to the request
> +    /// structure.
> +    type RequestData;
> +
> +    /// Data associated with the `struct request_queue` that is allocated for
> +    /// the `GenDisk` associated with this `Operations` implementation.
> +    type QueueData: ForeignOwnable;
> +
> +    /// Data associated with a dispatch queue. This is stored as a pointer in
> +    /// `struct blk_mq_hw_ctx`.
> +    type HwData: ForeignOwnable;
> +
> +    /// Data associated with a tag set. This is stored as a pointer in `struct
> +    /// blk_mq_tag_set`.
> +    type TagSetData: ForeignOwnable;
> +
> +    /// Called by the kernel to allocate a new `RequestData`. The structure will
> +    /// eventually be pinned, so defer initialization to `init_request_data()`
> +    fn new_request_data(
> +        _tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
> +    ) -> Result<Self::RequestData>;
> +
> +    /// Called by the kernel to initialize a previously allocated `RequestData`
> +    fn init_request_data(
> +        _tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
> +        _data: Pin<&mut Self::RequestData>,
> +    ) -> Result {
> +        Ok(())
> +    }
> +
> +    /// Called by the kernel to queue a request with the driver. If `is_last` is
> +    /// `false`, the driver is allowed to defer commiting the request.
> +    fn queue_rq(
> +        hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>,
> +        queue_data: <Self::QueueData as ForeignOwnable>::Borrowed<'_>,
> +        rq: &Request<Self>,
> +        is_last: bool,
> +    ) -> Result;
> +
> +    /// Called by the kernel to indicate that queued requests should be submitted
> +    fn commit_rqs(
> +        hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>,
> +        queue_data: <Self::QueueData as ForeignOwnable>::Borrowed<'_>,
> +    );
> +
> +    /// Called by the kernel when the request is completed
> +    fn complete(_rq: &Request<Self>);
> +
> +    /// Called by the kernel to allocate and initialize a driver specific hardware context data
> +    fn init_hctx(
> +        tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
> +        hctx_idx: u32,
> +    ) -> Result<Self::HwData>;
> +
> +    /// Called by the kernel to poll the device for completed requests. Only used for poll queues.
> +    fn poll(_hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>) -> i32 {
> +        unreachable!()

Why are these implemented this way? Should this really panic? Maybe
return an error? Why `i32` as the return type? If it can error it should
be `Result<u32>`.

> +    }
> +
> +    /// Called by the kernel to map submission queues to CPU cores.
> +    fn map_queues(_tag_set: &TagSetRef) {
> +        unreachable!()
> +    }
> +
> +    // There is no need for exit_request() because `drop` will be called.
> +}
> +
> +pub(crate) struct OperationsVtable<T: Operations>(PhantomData<T>);
> +
> +impl<T: Operations> OperationsVtable<T> {
> +    // # Safety
> +    //
> +    // The caller of this function must ensure that `hctx` and `bd` are valid
> +    // and initialized. The pointees must outlive this function. Further
> +    // `hctx->driver_data` must be a pointer created by a call to
> +    // `Self::init_hctx_callback()` and the pointee must outlive this function.
> +    // This function must not be called with a `hctx` for which
> +    // `Self::exit_hctx_callback()` has been called.
> +    unsafe extern "C" fn queue_rq_callback(
> +        hctx: *mut bindings::blk_mq_hw_ctx,
> +        bd: *const bindings::blk_mq_queue_data,
> +    ) -> bindings::blk_status_t {
> +        // SAFETY: `bd` is valid as required by the safety requirement for this function.
> +        let rq = unsafe { (*bd).rq };
> +
> +        // SAFETY: The safety requirement for this function ensure that
> +        // `(*hctx).driver_data` was returned by a call to
> +        // `Self::init_hctx_callback()`. That function uses
> +        // `PointerWrapper::into_pointer()` to create `driver_data`. Further,
> +        // the returned value does not outlive this function and
> +        // `from_foreign()` is not called until `Self::exit_hctx_callback()` is
> +        // called. By the safety requirement of this function and contract with
> +        // the `blk-mq` API, `queue_rq_callback()` will not be called after that
> +        // point.

This safety section and the others here are rather long and mostly
repeat themselves. Is it possible to put this in its own module and
explain the safety invariants in that module and then in these safety
sections just refer to some labels from that section?

I think we should discuss this in our next meeting.

> +        let hw_data = unsafe { T::HwData::borrow((*hctx).driver_data) };
> +
> +        // SAFETY: `hctx` is valid as required by this function.
> +        let queue_data = unsafe { (*(*hctx).queue).queuedata };
> +
> +        // SAFETY: `queue.queuedata` was created by `GenDisk::try_new()` with a
> +        // call to `ForeignOwnable::into_pointer()` to create `queuedata`.
> +        // `ForeignOwnable::from_foreign()` is only called when the tagset is
> +        // dropped, which happens after we are dropped.
> +        let queue_data = unsafe { T::QueueData::borrow(queue_data) };
> +
> +        // SAFETY: `bd` is valid as required by the safety requirement for this function.
> +        let ret = T::queue_rq(
> +            hw_data,
> +            queue_data,
> +            &unsafe { Request::from_ptr(rq) },
> +            unsafe { (*bd).last },
> +        );
> +        if let Err(e) = ret {
> +            e.to_blk_status()
> +        } else {
> +            bindings::BLK_STS_OK as _
> +        }
> +    }
> +
> +    unsafe extern "C" fn commit_rqs_callback(hctx: *mut bindings::blk_mq_hw_ctx) {
> +        let hw_data = unsafe { T::HwData::borrow((*hctx).driver_data) };
> +
> +        // SAFETY: `hctx` is valid as required by this function.
> +        let queue_data = unsafe { (*(*hctx).queue).queuedata };
> +
> +        // SAFETY: `queue.queuedata` was created by `GenDisk::try_new()` with a
> +        // call to `ForeignOwnable::into_pointer()` to create `queuedata`.
> +        // `ForeignOwnable::from_foreign()` is only called when the tagset is
> +        // dropped, which happens after we are dropped.
> +        let queue_data = unsafe { T::QueueData::borrow(queue_data) };
> +        T::commit_rqs(hw_data, queue_data)
> +    }
> +
> +    unsafe extern "C" fn complete_callback(rq: *mut bindings::request) {
> +        T::complete(&unsafe { Request::from_ptr(rq) });
> +    }
> +
> +    unsafe extern "C" fn poll_callback(
> +        hctx: *mut bindings::blk_mq_hw_ctx,
> +        _iob: *mut bindings::io_comp_batch,
> +    ) -> core::ffi::c_int {
> +        let hw_data = unsafe { T::HwData::borrow((*hctx).driver_data) };
> +        T::poll(hw_data)
> +    }
> +
> +    unsafe extern "C" fn init_hctx_callback(
> +        hctx: *mut bindings::blk_mq_hw_ctx,
> +        tagset_data: *mut core::ffi::c_void,
> +        hctx_idx: core::ffi::c_uint,
> +    ) -> core::ffi::c_int {
> +        from_result(|| {
> +            let tagset_data = unsafe { T::TagSetData::borrow(tagset_data) };
> +            let data = T::init_hctx(tagset_data, hctx_idx)?;
> +            unsafe { (*hctx).driver_data = data.into_foreign() as _ };
> +            Ok(0)
> +        })
> +    }
> +
> +    unsafe extern "C" fn exit_hctx_callback(
> +        hctx: *mut bindings::blk_mq_hw_ctx,
> +        _hctx_idx: core::ffi::c_uint,
> +    ) {
> +        let ptr = unsafe { (*hctx).driver_data };
> +        unsafe { T::HwData::from_foreign(ptr) };
> +    }
> +
> +    unsafe extern "C" fn init_request_callback(
> +        set: *mut bindings::blk_mq_tag_set,
> +        rq: *mut bindings::request,
> +        _hctx_idx: core::ffi::c_uint,
> +        _numa_node: core::ffi::c_uint,
> +    ) -> core::ffi::c_int {
> +        from_result(|| {
> +            // SAFETY: The tagset invariants guarantee that all requests are allocated with extra memory
> +            // for the request data.
> +            let pdu = unsafe { bindings::blk_mq_rq_to_pdu(rq) } as *mut T::RequestData;
> +            let tagset_data = unsafe { T::TagSetData::borrow((*set).driver_data) };
> +
> +            let v = T::new_request_data(tagset_data)?;
> +
> +            // SAFETY: `pdu` memory is valid, as it was allocated by the caller.
> +            unsafe { pdu.write(v) };
> +
> +            let tagset_data = unsafe { T::TagSetData::borrow((*set).driver_data) };
> +            // SAFETY: `pdu` memory is valid and properly initialised.
> +            T::init_request_data(tagset_data, unsafe { Pin::new_unchecked(&mut *pdu) })?;
> +
> +            Ok(0)
> +        })
> +    }
> +
> +    unsafe extern "C" fn exit_request_callback(
> +        _set: *mut bindings::blk_mq_tag_set,
> +        rq: *mut bindings::request,
> +        _hctx_idx: core::ffi::c_uint,
> +    ) {
> +        // SAFETY: The tagset invariants guarantee that all requests are allocated with extra memory
> +        // for the request data.
> +        let pdu = unsafe { bindings::blk_mq_rq_to_pdu(rq) } as *mut T::RequestData;
> +
> +        // SAFETY: `pdu` is valid for read and write and is properly initialised.
> +        unsafe { core::ptr::drop_in_place(pdu) };
> +    }
> +
> +    unsafe extern "C" fn map_queues_callback(tag_set_ptr: *mut bindings::blk_mq_tag_set) {
> +        let tag_set = unsafe { TagSetRef::from_ptr(tag_set_ptr) };
> +        T::map_queues(&tag_set);
> +    }
> +
> +    const VTABLE: bindings::blk_mq_ops = bindings::blk_mq_ops {
> +        queue_rq: Some(Self::queue_rq_callback),
> +        queue_rqs: None,
> +        commit_rqs: Some(Self::commit_rqs_callback),
> +        get_budget: None,
> +        put_budget: None,
> +        set_rq_budget_token: None,
> +        get_rq_budget_token: None,
> +        timeout: None,
> +        poll: if T::HAS_POLL {
> +            Some(Self::poll_callback)
> +        } else {
> +            None
> +        },
> +        complete: Some(Self::complete_callback),
> +        init_hctx: Some(Self::init_hctx_callback),
> +        exit_hctx: Some(Self::exit_hctx_callback),
> +        init_request: Some(Self::init_request_callback),
> +        exit_request: Some(Self::exit_request_callback),
> +        cleanup_rq: None,
> +        busy: None,
> +        map_queues: if T::HAS_MAP_QUEUES {
> +            Some(Self::map_queues_callback)
> +        } else {
> +            None
> +        },
> +        #[cfg(CONFIG_BLK_DEBUG_FS)]
> +        show_rq: None,
> +    };
> +
> +    pub(crate) const unsafe fn build() -> &'static bindings::blk_mq_ops {
> +        &Self::VTABLE
> +    }

Why is this function `unsafe`?

> +}

Some `# Safety` and `SAFETY` missing in this hunk.

> diff --git a/rust/kernel/block/mq/raw_writer.rs b/rust/kernel/block/mq/raw_writer.rs
> new file mode 100644
> index 000000000000..25c16ee0b1f7
> --- /dev/null
> +++ b/rust/kernel/block/mq/raw_writer.rs
> @@ -0,0 +1,30 @@
> +use core::fmt::{self, Write};
> +
> +pub(crate) struct RawWriter {
> +    ptr: *mut u8,
> +    len: usize,
> +}
> +
> +impl RawWriter {
> +    unsafe fn new(ptr: *mut u8, len: usize) -> Self {
> +        Self { ptr, len }
> +    }
> +
> +    pub(crate) fn from_array<const N: usize>(a: &mut [core::ffi::c_char; N]) -> Self {
> +        unsafe { Self::new(&mut a[0] as *mut _ as _, N) }
> +    }

This function needs to be `unsafe`, because it never captures the
lifetime of `a`. I can write:
    let mut a = Box::new([0; 10]);
    let mut writer = RawWriter::from_array(&mut *a);
    drop(a);
    writer.write_str("Abc"); // UAF
Alternatively add a lifetime to `RawWriter`.

> +}
> +
> +impl Write for RawWriter {
> +    fn write_str(&mut self, s: &str) -> fmt::Result {
> +        let bytes = s.as_bytes();
> +        let len = bytes.len();
> +        if len > self.len {
> +            return Err(fmt::Error);
> +        }
> +        unsafe { core::ptr::copy_nonoverlapping(&bytes[0], self.ptr, len) };
> +        self.ptr = unsafe { self.ptr.add(len) };
> +        self.len -= len;
> +        Ok(())
> +    }
> +}
> diff --git a/rust/kernel/block/mq/request.rs b/rust/kernel/block/mq/request.rs
> new file mode 100644
> index 000000000000..e95ae3fd71ad
> --- /dev/null
> +++ b/rust/kernel/block/mq/request.rs
> @@ -0,0 +1,71 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +//! This module provides a wrapper for the C `struct request` type.
> +//!
> +//! C header: [`include/linux/blk-mq.h`](../../include/linux/blk-mq.h)
> +
> +use crate::{
> +    bindings,
> +    block::mq::Operations,
> +    error::{Error, Result},
> +};
> +use core::marker::PhantomData;
> +
> +/// A wrapper around a blk-mq `struct request`. This represents an IO request.
> +pub struct Request<T: Operations> {
> +    ptr: *mut bindings::request,

Why is this not embedded?

> +    _p: PhantomData<T>,
> +}
> +
> +impl<T: Operations> Request<T> {
> +    pub(crate) unsafe fn from_ptr(ptr: *mut bindings::request) -> Self {
> +        Self {
> +            ptr,
> +            _p: PhantomData,
> +        }
> +    }
> +
> +    /// Get the command identifier for the request
> +    pub fn command(&self) -> u32 {
> +        unsafe { (*self.ptr).cmd_flags & ((1 << bindings::REQ_OP_BITS) - 1) }
> +    }
> +
> +    /// Call this to indicate to the kernel that the request has been issued by the driver
> +    pub fn start(&self) {
> +        unsafe { bindings::blk_mq_start_request(self.ptr) };
> +    }
> +
> +    /// Call this to indicate to the kernel that the request has been completed without errors
> +    // TODO: Consume rq so that we can't use it after ending it?
> +    pub fn end_ok(&self) {
> +        unsafe { bindings::blk_mq_end_request(self.ptr, bindings::BLK_STS_OK as _) };
> +    }
> +
> +    /// Call this to indicate to the kernel that the request completed with an error
> +    pub fn end_err(&self, err: Error) {
> +        unsafe { bindings::blk_mq_end_request(self.ptr, err.to_blk_status()) };
> +    }
> +
> +    /// Call this to indicate that the request completed with the status indicated by `status`
> +    pub fn end(&self, status: Result) {
> +        if let Err(e) = status {
> +            self.end_err(e);
> +        } else {
> +            self.end_ok();
> +        }
> +    }
> +
> +    /// Call this to schedule defered completion of the request
> +    // TODO: Consume rq so that we can't use it after completing it?
> +    pub fn complete(&self) {
> +        if !unsafe { bindings::blk_mq_complete_request_remote(self.ptr) } {
> +            T::complete(&unsafe { Self::from_ptr(self.ptr) });
> +        }
> +    }
> +
> +    /// Get the target sector for the request
> +    #[inline(always)]

Why is this `inline(always)`?

> +    pub fn sector(&self) -> usize {
> +        unsafe { (*self.ptr).__sector as usize }
> +    }
> +}
> diff --git a/rust/kernel/block/mq/tag_set.rs b/rust/kernel/block/mq/tag_set.rs
> new file mode 100644
> index 000000000000..d122db7f6d0e
> --- /dev/null
> +++ b/rust/kernel/block/mq/tag_set.rs
> @@ -0,0 +1,92 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +//! This module provides the `TagSet` struct to wrap the C `struct blk_mq_tag_set`.
> +//!
> +//! C header: [`include/linux/blk-mq.h`](../../include/linux/blk-mq.h)
> +
> +use crate::{
> +    bindings,
> +    block::mq::{operations::OperationsVtable, Operations},
> +    error::{Error, Result},
> +    sync::Arc,
> +    types::ForeignOwnable,
> +};
> +use core::{cell::UnsafeCell, convert::TryInto, marker::PhantomData};
> +
> +/// A wrapper for the C `struct blk_mq_tag_set`
> +pub struct TagSet<T: Operations> {
> +    inner: UnsafeCell<bindings::blk_mq_tag_set>,
> +    _p: PhantomData<T>,
> +}
> +
> +impl<T: Operations> TagSet<T> {
> +    /// Try to create a new tag set
> +    pub fn try_new(
> +        nr_hw_queues: u32,
> +        tagset_data: T::TagSetData,
> +        num_tags: u32,
> +        num_maps: u32,
> +    ) -> Result<Arc<Self>> {

Why force the users to use `Arc`?

> +        let tagset = Arc::try_new(Self {
> +            inner: UnsafeCell::new(bindings::blk_mq_tag_set::default()),
> +            _p: PhantomData,
> +        })?;
> +
> +        // SAFETY: We just allocated `tagset`, we know this is the only reference to it.
> +        let inner = unsafe { &mut *tagset.inner.get() };
> +
> +        inner.ops = unsafe { OperationsVtable::<T>::build() };
> +        inner.nr_hw_queues = nr_hw_queues;
> +        inner.timeout = 0; // 0 means default which is 30 * HZ in C
> +        inner.numa_node = bindings::NUMA_NO_NODE;
> +        inner.queue_depth = num_tags;
> +        inner.cmd_size = core::mem::size_of::<T::RequestData>().try_into()?;
> +        inner.flags = bindings::BLK_MQ_F_SHOULD_MERGE;
> +        inner.driver_data = tagset_data.into_foreign() as _;
> +        inner.nr_maps = num_maps;
> +
> +        // SAFETY: `inner` points to valid and initialised memory.
> +        let ret = unsafe { bindings::blk_mq_alloc_tag_set(inner) };
> +        if ret < 0 {
> +            // SAFETY: We created `driver_data` above with `into_foreign`
> +            unsafe { T::TagSetData::from_foreign(inner.driver_data) };
> +            return Err(Error::from_errno(ret));
> +        }
> +
> +        Ok(tagset)
> +    }
> +
> +    /// Return the pointer to the wrapped `struct blk_mq_tag_set`
> +    pub(crate) fn raw_tag_set(&self) -> *mut bindings::blk_mq_tag_set {
> +        self.inner.get()
> +    }
> +}
> +
> +impl<T: Operations> Drop for TagSet<T> {
> +    fn drop(&mut self) {
> +        let tagset_data = unsafe { (*self.inner.get()).driver_data };
> +
> +        // SAFETY: `inner` is valid and has been properly initialised during construction.
> +        unsafe { bindings::blk_mq_free_tag_set(self.inner.get()) };
> +
> +        // SAFETY: `tagset_data` was created by a call to
> +        // `ForeignOwnable::into_foreign` in `TagSet::try_new()`
> +        unsafe { T::TagSetData::from_foreign(tagset_data) };
> +    }
> +}
> +
> +/// A tag set reference. Used to control lifetime and prevent drop of TagSet references passed to
> +/// `Operations::map_queues()`
> +pub struct TagSetRef {
> +    ptr: *mut bindings::blk_mq_tag_set,
> +}
> +
> +impl TagSetRef {
> +    pub(crate) unsafe fn from_ptr(tagset: *mut bindings::blk_mq_tag_set) -> Self {
> +        Self { ptr: tagset }
> +    }
> +
> +    pub fn ptr(&self) -> *mut bindings::blk_mq_tag_set {
> +        self.ptr
> +    }
> +}

This is a **very** thin abstraction, why is it needed?

> diff --git a/rust/kernel/error.rs b/rust/kernel/error.rs
> index 5f4114b30b94..421fef677321 100644
> --- a/rust/kernel/error.rs
> +++ b/rust/kernel/error.rs
> @@ -107,6 +107,10 @@ impl Error {
>          self.0
>      }
> 
> +    pub(crate) fn to_blk_status(self) -> bindings::blk_status_t {
> +        unsafe { bindings::errno_to_blk_status(self.0) }
> +    }
> +
>      /// Returns the error encoded as a pointer.
>      #[allow(dead_code)]
>      pub(crate) fn to_ptr<T>(self) -> *mut T {
> diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
> index 8bef6686504b..cd798d12d97c 100644
> --- a/rust/kernel/lib.rs
> +++ b/rust/kernel/lib.rs
> @@ -34,6 +34,7 @@ extern crate self as kernel;
>  #[cfg(not(test))]
>  #[cfg(not(testlib))]
>  mod allocator;
> +pub mod block;
>  mod build_assert;
>  pub mod error;
>  pub mod init;
> --
> 2.40.0
> 

--
Cheers,
Benno

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 04/11] rust: block: introduce `kernel::block::bio` module
  2023-05-03  9:07 ` [RFC PATCH 04/11] rust: block: introduce `kernel::block::bio` module Andreas Hindborg
@ 2023-05-08 12:58   ` Benno Lossin
  2024-01-11 12:49     ` Andreas Hindborg (Samsung)
  0 siblings, 1 reply; 69+ messages in thread
From: Benno Lossin @ 2023-05-08 12:58 UTC (permalink / raw)
  To: Andreas Hindborg
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Andreas Hindborg, Matthew Wilcox, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	linux-kernel, gost.dev

On Wednesday, May 3rd, 2023 at 11:07, Andreas Hindborg <nmi@metaspace.dk> wrote:
> From: Andreas Hindborg <a.hindborg@samsung.com>
> 
> Add abstractions for working with `struct bio`.
> 
> Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com>
> ---
>  rust/kernel/block.rs            |   1 +
>  rust/kernel/block/bio.rs        |  93 ++++++++++++++++
>  rust/kernel/block/bio/vec.rs    | 181 ++++++++++++++++++++++++++++++++
>  rust/kernel/block/mq/request.rs |  16 +++
>  4 files changed, 291 insertions(+)
>  create mode 100644 rust/kernel/block/bio.rs
>  create mode 100644 rust/kernel/block/bio/vec.rs
> 
> diff --git a/rust/kernel/block.rs b/rust/kernel/block.rs
> index 4c93317a568a..1797859551fd 100644
> --- a/rust/kernel/block.rs
> +++ b/rust/kernel/block.rs
> @@ -2,4 +2,5 @@
> 
>  //! Types for working with the block layer
> 
> +pub mod bio;
>  pub mod mq;
> diff --git a/rust/kernel/block/bio.rs b/rust/kernel/block/bio.rs
> new file mode 100644
> index 000000000000..6e93e4420105
> --- /dev/null
> +++ b/rust/kernel/block/bio.rs
> @@ -0,0 +1,93 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +//! Types for working with the bio layer.
> +//!
> +//! C header: [`include/linux/blk_types.h`](../../include/linux/blk_types.h)
> +
> +use core::fmt;
> +use core::ptr::NonNull;
> +
> +mod vec;
> +
> +pub use vec::BioSegmentIterator;
> +pub use vec::Segment;
> +
> +/// A wrapper around a `struct bio` pointer
> +///
> +/// # Invariants
> +///
> +/// First field must alwyas be a valid pointer to a valid `struct bio`.
> +pub struct Bio<'a>(
> +    NonNull<crate::bindings::bio>,
> +    core::marker::PhantomData<&'a ()>,

Please make this a struct with named fields. Also this is rather a
`BioRef`, right? Why can't `&Bio` be used and `Bio` embeds
`bindings::bio`?

> +);
> +
> +impl<'a> Bio<'a> {
> +    /// Returns an iterator over segments in this `Bio`. Does not consider
> +    /// segments of other bios in this bio chain.
> +    #[inline(always)]

Why are these `inline(always)`? The compiler should inline them
automatically?

> +    pub fn segment_iter(&'a self) -> BioSegmentIterator<'a> {
> +        BioSegmentIterator::new(self)
> +    }
> +
> +    /// Get a pointer to the `bio_vec` off this bio
> +    #[inline(always)]
> +    fn io_vec(&self) -> *const bindings::bio_vec {
> +        // SAFETY: By type invariant, get_raw() returns a valid pointer to a
> +        // valid `struct bio`
> +        unsafe { (*self.get_raw()).bi_io_vec }
> +    }
> +
> +    /// Return a copy of the `bvec_iter` for this `Bio`
> +    #[inline(always)]
> +    fn iter(&self) -> bindings::bvec_iter {

Why does this return the bindings iter? Maybe rename to `raw_iter`?

> +        // SAFETY: self.0 is always a valid pointer
> +        unsafe { (*self.get_raw()).bi_iter }
> +    }
> +
> +    /// Get the next `Bio` in the chain
> +    #[inline(always)]
> +    fn next(&self) -> Option<Bio<'a>> {
> +        // SAFETY: self.0 is always a valid pointer
> +        let next = unsafe { (*self.get_raw()).bi_next };
> +        Some(Self(NonNull::new(next)?, core::marker::PhantomData))

Missing `INVARIANT` explaining why `next` is valid or null. Also why not
use `Self::from_raw` here?

> +    }
> +
> +    /// Return the raw pointer of the wrapped `struct bio`
> +    #[inline(always)]
> +    fn get_raw(&self) -> *const bindings::bio {
> +        self.0.as_ptr()
> +    }
> +
> +    /// Create an instance of `Bio` from a raw pointer. Does check that the
> +    /// pointer is not null.
> +    #[inline(always)]
> +    pub(crate) unsafe fn from_raw(ptr: *mut bindings::bio) -> Option<Bio<'a>> {
> +        Some(Self(NonNull::new(ptr)?, core::marker::PhantomData))
> +    }
> +}
> +
> +impl core::fmt::Display for Bio<'_> {
> +    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
> +        write!(f, "Bio {:?}", self.0.as_ptr())

this will display as `Bio 0x7ff0654..` I think there should be some
symbol wrapping the pointer like `Bio(0x7ff0654)` or `Bio { ptr: 0x7ff0654 }`.

> +    }
> +}
> +
> +/// An iterator over `Bio`
> +pub struct BioIterator<'a> {
> +    pub(crate) bio: Option<Bio<'a>>,
> +}
> +
> +impl<'a> core::iter::Iterator for BioIterator<'a> {
> +    type Item = Bio<'a>;
> +
> +    #[inline(always)]
> +    fn next(&mut self) -> Option<Bio<'a>> {
> +        if let Some(current) = self.bio.take() {
> +            self.bio = current.next();
> +            Some(current)
> +        } else {
> +            None
> +        }

Can be rewritten as:
    let current = self.bio.take()?;
    self.bio = current.next();
    Some(cur)

> +    }
> +}
> diff --git a/rust/kernel/block/bio/vec.rs b/rust/kernel/block/bio/vec.rs
> new file mode 100644
> index 000000000000..acd328a6fe54
> --- /dev/null
> +++ b/rust/kernel/block/bio/vec.rs
> @@ -0,0 +1,181 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +//! Types for working with `struct bio_vec` IO vectors
> +//!
> +//! C header: [`include/linux/bvec.h`](../../include/linux/bvec.h)
> +
> +use super::Bio;
> +use crate::error::Result;
> +use crate::pages::Pages;
> +use core::fmt;
> +use core::mem::ManuallyDrop;
> +
> +#[inline(always)]
> +fn mp_bvec_iter_offset(bvec: *const bindings::bio_vec, iter: &bindings::bvec_iter) -> u32 {
> +    (unsafe { (*bvec_iter_bvec(bvec, iter)).bv_offset }) + iter.bi_bvec_done
> +}
> +
> +#[inline(always)]
> +fn mp_bvec_iter_page(
> +    bvec: *const bindings::bio_vec,
> +    iter: &bindings::bvec_iter,
> +) -> *mut bindings::page {
> +    unsafe { (*bvec_iter_bvec(bvec, iter)).bv_page }
> +}
> +
> +#[inline(always)]
> +fn mp_bvec_iter_page_idx(bvec: *const bindings::bio_vec, iter: &bindings::bvec_iter) -> usize {
> +    (mp_bvec_iter_offset(bvec, iter) / crate::PAGE_SIZE) as usize
> +}
> +
> +#[inline(always)]
> +fn mp_bvec_iter_len(bvec: *const bindings::bio_vec, iter: &bindings::bvec_iter) -> u32 {
> +    iter.bi_size
> +        .min(unsafe { (*bvec_iter_bvec(bvec, iter)).bv_len } - iter.bi_bvec_done)
> +}
> +
> +#[inline(always)]
> +fn bvec_iter_bvec(
> +    bvec: *const bindings::bio_vec,
> +    iter: &bindings::bvec_iter,
> +) -> *const bindings::bio_vec {
> +    unsafe { bvec.add(iter.bi_idx as usize) }
> +}
> +
> +#[inline(always)]
> +fn bvec_iter_page(
> +    bvec: *const bindings::bio_vec,
> +    iter: &bindings::bvec_iter,
> +) -> *mut bindings::page {
> +    unsafe { mp_bvec_iter_page(bvec, iter).add(mp_bvec_iter_page_idx(bvec, iter)) }
> +}
> +
> +#[inline(always)]
> +fn bvec_iter_len(bvec: *const bindings::bio_vec, iter: &bindings::bvec_iter) -> u32 {
> +    mp_bvec_iter_len(bvec, iter).min(crate::PAGE_SIZE - bvec_iter_offset(bvec, iter))
> +}
> +
> +#[inline(always)]
> +fn bvec_iter_offset(bvec: *const bindings::bio_vec, iter: &bindings::bvec_iter) -> u32 {
> +    mp_bvec_iter_offset(bvec, iter) % crate::PAGE_SIZE
> +}

Why are these functions:
- not marked as `unsafe`?
- undocumented,
- free functions.

Can't these be directly implemented on `Segment<'_>`? If not,
I think we should find some better way or make them `unsafe`.

> +
> +/// A wrapper around a `strutct bio_vec` - a contiguous range of physical memory addresses
> +///
> +/// # Invariants
> +///
> +/// `bio_vec` must always be initialized and valid
> +pub struct Segment<'a> {
> +    bio_vec: bindings::bio_vec,
> +    _marker: core::marker::PhantomData<&'a ()>,
> +}
> +
> +impl Segment<'_> {
> +    /// Get he lenght of the segment in bytes
> +    #[inline(always)]
> +    pub fn len(&self) -> usize {
> +        self.bio_vec.bv_len as usize
> +    }
> +
> +    /// Returns true if the length of the segment is 0
> +    #[inline(always)]
> +    pub fn is_empty(&self) -> bool {
> +        self.len() == 0
> +    }
> +
> +    /// Get the offset field of the `bio_vec`
> +    #[inline(always)]
> +    pub fn offset(&self) -> usize {
> +        self.bio_vec.bv_offset as usize
> +    }
> +
> +    /// Copy data of this segment into `page`.
> +    #[inline(always)]
> +    pub fn copy_to_page_atomic(&self, page: &mut Pages<0>) -> Result {
> +        // SAFETY: self.bio_vec is valid and thus bv_page must be a valid
> +        // pointer to a `struct page`. We do not own the page, but we prevent
> +        // drop by wrapping the `Pages` in `ManuallyDrop`.
> +        let our_page = ManuallyDrop::new(unsafe { Pages::<0>::from_raw(self.bio_vec.bv_page) });
> +        let our_map = our_page.kmap_atomic();
> +
> +        // TODO: Checck offset is within page - what guarantees does `bio_vec` provide?
> +        let ptr = unsafe { (our_map.get_ptr() as *const u8).add(self.offset()) };
> +
> +        unsafe { page.write_atomic(ptr, self.offset(), self.len()) }
> +    }
> +
> +    /// Copy data from `page` into this segment
> +    #[inline(always)]
> +    pub fn copy_from_page_atomic(&mut self, page: &Pages<0>) -> Result {
> +        // SAFETY: self.bio_vec is valid and thus bv_page must be a valid
> +        // pointer to a `struct page`. We do not own the page, but we prevent
> +        // drop by wrapping the `Pages` in `ManuallyDrop`.
> +        let our_page = ManuallyDrop::new(unsafe { Pages::<0>::from_raw(self.bio_vec.bv_page) });
> +        let our_map = our_page.kmap_atomic();
> +
> +        // TODO: Checck offset is within page
> +        let ptr = unsafe { (our_map.get_ptr() as *mut u8).add(self.offset()) };
> +
> +        unsafe { page.read_atomic(ptr, self.offset(), self.len()) }
> +    }
> +}
> +
> +impl core::fmt::Display for Segment<'_> {
> +    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
> +        write!(
> +            f,
> +            "Segment {:?} len: {}",
> +            self.bio_vec.bv_page, self.bio_vec.bv_len
> +        )
> +    }
> +}
> +
> +/// An iterator over `Segment`
> +pub struct BioSegmentIterator<'a> {
> +    bio: &'a Bio<'a>,
> +    iter: bindings::bvec_iter,
> +}
> +
> +impl<'a> BioSegmentIterator<'a> {
> +    #[inline(always)]
> +    pub(crate) fn new(bio: &'a Bio<'a>) -> BioSegmentIterator<'_> {
> +        Self {
> +            bio,
> +            iter: bio.iter(),
> +        }
> +    }
> +}
> +
> +impl<'a> core::iter::Iterator for BioSegmentIterator<'a> {
> +    type Item = Segment<'a>;
> +
> +    #[inline(always)]
> +    fn next(&mut self) -> Option<Self::Item> {
> +        if self.iter.bi_size == 0 {
> +            return None;
> +        }
> +
> +        // Macro
> +        // bio_vec = bio_iter_iovec(bio, self.iter)
> +        // bio_vec = bvec_iter_bvec(bio.bi_io_vec, self.iter);

Weird comment?

> +        let bio_vec_ret = bindings::bio_vec {
> +            bv_page: bvec_iter_page(self.bio.io_vec(), &self.iter),
> +            bv_len: bvec_iter_len(self.bio.io_vec(), &self.iter),
> +            bv_offset: bvec_iter_offset(self.bio.io_vec(), &self.iter),
> +        };
> +
> +        // Static C function
> +        unsafe {
> +            bindings::bio_advance_iter_single(
> +                self.bio.get_raw(),
> +                &mut self.iter as *mut bindings::bvec_iter,
> +                bio_vec_ret.bv_len,
> +            )
> +        };
> +
> +        Some(Segment {
> +            bio_vec: bio_vec_ret,
> +            _marker: core::marker::PhantomData,
> +        })
> +    }
> +}
> diff --git a/rust/kernel/block/mq/request.rs b/rust/kernel/block/mq/request.rs
> index e95ae3fd71ad..ccb1033b64b6 100644
> --- a/rust/kernel/block/mq/request.rs
> +++ b/rust/kernel/block/mq/request.rs
> @@ -11,6 +11,9 @@ use crate::{
>  };
>  use core::marker::PhantomData;
> 
> +use crate::block::bio::Bio;
> +use crate::block::bio::BioIterator;
> +
>  /// A wrapper around a blk-mq `struct request`. This represents an IO request.
>  pub struct Request<T: Operations> {
>      ptr: *mut bindings::request,
> @@ -63,6 +66,19 @@ impl<T: Operations> Request<T> {
>          }
>      }
> 
> +    /// Get a wrapper for the first Bio in this request
> +    #[inline(always)]
> +    pub fn bio(&self) -> Option<Bio<'_>> {
> +        let ptr = unsafe { (*self.ptr).bio };
> +        unsafe { Bio::from_raw(ptr) }
> +    }
> +
> +    /// Get an iterator over all bio structurs in this request
> +    #[inline(always)]
> +    pub fn bio_iter(&self) -> BioIterator<'_> {
> +        BioIterator { bio: self.bio() }
> +    }
> +
>      /// Get the target sector for the request
>      #[inline(always)]
>      pub fn sector(&self) -> usize {
> --
> 2.40.0
> 

--
Cheers,
Benno

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 03/11] rust: block: introduce `kernel::block::mq` module
  2023-05-08 12:29   ` Benno Lossin
@ 2023-05-11  6:52     ` Sergio González Collado
  2024-01-23 14:03       ` Andreas Hindborg (Samsung)
  2024-01-12  9:18     ` Andreas Hindborg (Samsung)
  1 sibling, 1 reply; 69+ messages in thread
From: Sergio González Collado @ 2023-05-11  6:52 UTC (permalink / raw)
  To: Benno Lossin
  Cc: Andreas Hindborg, Jens Axboe, Christoph Hellwig, Keith Busch,
	Damien Le Moal, Hannes Reinecke, lsf-pc, rust-for-linux,
	linux-block, Andreas Hindborg, Matthew Wilcox, Miguel Ojeda,
	Alex Gaynor, Wedson Almeida Filho, Boqun Feng, Gary Guo,
	Björn Roy Baron, linux-kernel, gost.dev

+    /// Call to tell the block layer the capcacity of the device
+    pub fn set_capacity(&self, sectors: u64) {
+        unsafe { bindings::set_capacity(self.gendisk, sectors) };
+    }

Nit in the comment: capcacity -> capacity

Cheers,
 Sergio

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
  2023-05-04 19:02         ` Jens Axboe
                             ` (2 preceding siblings ...)
  2023-05-05  3:52           ` Christoph Hellwig
@ 2023-06-06 13:33           ` Andreas Hindborg (Samsung)
  2023-06-06 14:46             ` Miguel Ojeda
  3 siblings, 1 reply; 69+ messages in thread
From: Andreas Hindborg (Samsung) @ 2023-06-06 13:33 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Ming Lei
  Cc: Bart Van Assche, Damien Le Moal, Hannes Reinecke, rust-for-linux,
	linux-block, Matthew Wilcox, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	Benno Lossin, open list, gost.dev, Matias Bjørling,
	Niklas Cassel, Johannes Thumshirn


Hi All,

I apologize for the lengthy email, but I have a lot of things to cover.

As some of you know, a goal of mine is to make it possible to write blk-mq
device drivers in Rust. The RFC patches I have sent to this list are the first
steps of making that goal a reality. They are a sample of the work I am doing.

My current plan of action is to provide a Rust API that allows implementation of
blk-mq device drives, along with a Rust implementation of null_blk to serve as a
reference implementation. This reference implementation will demonstrate how to
use the API.

I attended LSF in Vancouver a few weeks back where I led a discussion on the
topic. My goal for that session was to obtain input from the community on how to
upstream the work as it becomes more mature.

I received a lot of feedback, both during the session, in the hallway, and on
the mailing list. Ultimately, we did not achieve consensus on a path forward. I
will try to condense the key points raised by the community here. If anyone feel
their point is not contained below, please chime in.

Please note that I am paraphrasing the points below, they are not citations.

1) "Block layer community does not speak Rust and thus cannot review Rust patches"

   This work hinges on one of two things happening. Either block layer reviewers
   and maintainers eventually becoming fluent in Rust, or they accept code in
   their tree that are maintained by the "rust people". I very much would prefer
   the first option.

   I would suggest to use this work to facilitate gradual adoption of Rust. I
   understand that this will be a multi-year effort. By giving the community
   access to a Rust bindings specifically designed or the block layer, the block
   layer community will have a helpful reference to consult when investigating
   Rust.

   While the block community is getting up to speed in Rust, the Rust for Linux
   community is ready to conduct review of patches targeting the block layer.
   Until such a time where Rust code can be reviewed by block layer experts, the
   work could be gated behind an "EXPERIMENTAL" flag.

   Selection of the null_blk driver for a reference implementation to drive the
   Rust block API was not random. The null_blk driver is relatively simple and
   thus makes for a good platform to demonstrate the Rust API without having to
   deal with actual hardware.

   The null_blk driver is a piece of testing infrastructure that is not usually
   deployed in production environments, so people who are worried about Rust in
   general will not have to worry about their production environments being
   infested with Rust.

   Finally there have been suggestions both to replace and/or complement the
   existing C null_blk driver with the Rust version. I would suggest
   (eventually, not _now_) complementing the existing driver, since it can be
   very useful to benchmark and test the two drivers side by side.

2) "Having Rust bindings for the block layer in-tree is a burden for the
   maintainers"

   I believe we can integrate the bindings in a way so that any potential
   breakage in the Rust API does not impact current maintenance work.
   Maintainers and reviewers that do not wish to bother with Rust should be able
   to opt out. All Rust parts should be gated behind a default N kconfig option.
   With this scheme there should be very little inconvenience for current
   maintainers.

   I will take necessary steps to make sure block layer Rust bindings are always
   up to date with changes to kernel C API. I would run CI against

   - for-next of https://git.kernel.dk/linux.git
   - master of https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
   - mainline releases including RCs
   - stable and longterm kernels with queues applied
   - stable and longterm releases including RCs

   Samsung will provide resources to support this CI effort. Through this effort
   I will aim to minimize any inconvenience for maintainers.

3) "How will you detect breakage in the Rust API caused by changes to C code?"

   The way we call C code from Rust in the kernel guarantees that most changes
   to C APIs that are called by Rust code will cause a compile failure when
   building the kernel with Rust enabled. This includes changing C function
   argument names or types, and struct field names or types. Thus, we do not need
   to rely on symvers CRC calculation as suggested by James Bottomley at LSF.

   However, if the semantics of a kernel C function is changed without changing
   its name or signature, potential breakage will not be detected by the build
   system. To detect breakage resulting from this kind of change, we have to
   rely _on the same mechanics_ that maintainers of kernel C code are relying on
   today:

   - kunit tests
   - blktests
   - fstests
   - staying in the loop wrt changes in general

   We also have Rust support in Intel 0-day CI, although only compile tests for
   now.

4) "How will you prevent breakage in C code resulting from changes to Rust code"

   The way the Rust API is designed, existing C code is not going to be reliant
   on Rust code. If anything breaks just disable Rust and no Rust code will be
   built. Or disable block layer Rust code if you want to keep general Rust
   support. If Rust is disabled by default, nothing in the kernel should break
   because of Rust, if not explicitly enabled.

5) "Block drivers in general are not security sensitive because they are mostly
   privileged code and have limited user visible API"

   There are probably easier ways to exploit a Linux system than to target the
   block layer, although people are plugging in potentially malicious block
   devices all the time in the form of USB Mass Storage devices or CF cards.

   While memory safety is very relevant for preventing exploitable security
   vulnerabilities, it is also incredibly useful in preventing memory safety
   bugs in general. Fewer bugs means less risk of bugs leading to data
   corruption. It means less time spent on tracking down and fixing bugs, and
   less time spent reviewing bug fixes. It also means less time required to
   review patches in general, because reviewers do not have to review for memory
   safety issues.

   So while Rust has high merit in exposed and historically exploited
   subsystems, this does not mean that it has no merit in other subsystems.

6) "Other subsystems may benefit more from adopting Rust"

   While this might be true, it does not prevent the block subsystem from
   benefiting from adopting Rust (see 5).


7) "Do not waste time re-implementing null_blk, it is test infrastructure so
   memory safety does not matter. Why don't you do loop instead?"

   I strongly believe that memory safety is also relevant in test
   infrastructure. We waste time and energy fixing memory safety issues in our
   code, no matter if the code is test infrastructure or not. I refer to the
   statistics I posted to the list at an earlier date [3].

   Further, I think it is a benefit to all if the storage community can become
   fluent in Rust before any critical infrastructure is deployed using Rust.
   This is one reason that I switched my efforts to null_block and that I am not
   pushing Rust NVMe.

8) "Why don't you wait with this work until you have a driver for a new storage
   standard"

   Let's be proactive. I think it is important to iron out the details of the
   Rust API before we implement any potential new driver. When we eventually
   need to implement a driver for a future storage standard, the choice to do so
   in Rust should be easy. By making the API available ahead of time, we will be
   able to provide future developers with a stable implementation to choose
   from.

9) "You are a new face in our community. How do we know you will not disappear?"

   I recognize this consideration and I acknowledge that the community is trust
   based. Trust takes time to build. I can do little more than state that I
   intend to stay with my team at Samsung to take care of this project for many
   years to come. Samsung is behind this particular effort. In general Google
   and Microsoft are actively contributing to the wider Rust for Linux project.
   Perhaps that can be an indication that the project in general is not going
   away.

10) "How can I learn how to build the kernel with Rust enabled?"

    We have a guide in `Documentation/rust/quick-start.rst`. If that guide does
    not get you started, please reach out to us [1] and we will help you get
    started (and fix the documentation since it must not be good enough then).

11) "What if something catches fire and you are out of office?"

    If I am for some reason not responding to pings during a merge, please
    contact the Rust subsystem maintainer and the Rust for Linux list [2]. There
    are quite a few people capable of firefighting if it should ever become
    necessary.

12) "These patches are not ready yet, we should not accept them"

    They most definitely are _not_ ready, and I would not ask for them to be
    included at all in their current state. The RFC is meant to give a sample of
    the work that I am doing and to start this conversation. I would rather have
    this conversation preemptively. I did not intend to give the impression that
    the patches are in a finalized state at all.


With all this in mind I would suggest that we treat the Rust block layer API and
associated null block driver as an experiment. I would suggest that we merge it
in when it is ready, and we gate it behind an experimental kconfig option. If it
turns out that all your worst nightmares come true and it becomes an unbearable
load for maintainers, reviewers and contributors, it will be low effort remove
it again. I very much doubt this will be the case though.

Jens, Kieth, Christoph, Ming, I would kindly ask you to comment on my suggestion
for next steps, or perhaps suggest an alternate path. In general I would
appreciate any constructive feedback from the community.

[1] https://rust-for-linux.com/contact
[2] rust-for-linux@vger.kernel.org
[3] https://lore.kernel.org/all/87y1ofj5tt.fsf@metaspace.dk/

Best regards,
Andreas Hindborg

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
  2023-06-06 13:33           ` Andreas Hindborg (Samsung)
@ 2023-06-06 14:46             ` Miguel Ojeda
  0 siblings, 0 replies; 69+ messages in thread
From: Miguel Ojeda @ 2023-06-06 14:46 UTC (permalink / raw)
  To: Andreas Hindborg (Samsung)
  Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Ming Lei,
	Bart Van Assche, Damien Le Moal, Hannes Reinecke, rust-for-linux,
	linux-block, Matthew Wilcox, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	Benno Lossin, open list, gost.dev, Matias Bjørling,
	Niklas Cassel, Johannes Thumshirn, Guillaume Tucker

On Tue, Jun 6, 2023 at 3:40 PM Andreas Hindborg (Samsung)
<nmi@metaspace.dk> wrote:
>
>    Samsung will provide resources to support this CI effort. Through this effort
>    I will aim to minimize any inconvenience for maintainers.

This is great.

>    We also have Rust support in Intel 0-day CI, although only compile tests for
>    now.

In addition, we also have initial Rust support in KernelCI (including
`linux-next` [1], `rust-next` and the old `rust` branch), which could
be expanded upon, especially with more resources (Cc'ing Guillaume).

Moreover, our plan is to replicate the simple CI we originally had for
the `rust` branch (which included QEMU boot tests under a matrix of
several archs/configs/compilers) to our `rust-next`, `rust-fixes` and
`rust-dev` branches, in order to complement the other CIs and to get
some early smoke testing (especially for `rust-dev`).

[1] https://github.com/kernelci/kernelci-core/blob/65b0900438e0ed20e7efe0ada681ab212dc8c774/config/core/build-configs.yaml#L1152-L1197

Cheers,
Miguel

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
  2023-05-03  9:06 [RFC PATCH 00/11] Rust null block driver Andreas Hindborg
                   ` (13 preceding siblings ...)
  2023-05-07 23:31 ` Luis Chamberlain
@ 2023-07-27  3:45 ` Yexuan Yang
  2023-07-27  3:47 ` Yexuan Yang
       [not found] ` <2B3CA5F1CCCFEAB2+20230727034517.GB126117@1182282462>
  16 siblings, 0 replies; 69+ messages in thread
From: Yexuan Yang @ 2023-07-27  3:45 UTC (permalink / raw)
  To: 1182282462
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Andreas Hindborg, Matthew Wilcox, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Bj??rn Roy Baron,
	Benno Lossin, open list, gost.dev

> Over the 432 benchmark configurations, the relative performance of the Rust
> driver to the C driver (in terms of IOPS) is between 6.8 and -11.8 percent with
> an average of 0.2 percent better for reads. For writes the span is 16.8 to -4.5
> percent with an average of 0.9 percent better.
> 
> For each measurement the drivers are loaded, a drive is configured with memory
> backing and a size of 4 GiB. C null_blk is configured to match the implemented
> modes of the Rust driver: `blocksize` is set to 4 KiB, `completion_nsec` to 0,
> `irqmode` to 0 (IRQ_NONE), `queue_mode` to 2 (MQ), `hw_queue_depth` to 256 and
> `memory_backed` to 1. For both the drivers, the queue scheduler is set to
> `none`. These measurements are made using 30 second runs of `fio` with the
> `PSYNC` IO engine with workers pinned to separate CPU cores. The measurements
> are done inside a virtual machine (qemu/kvm) on an Intel Alder Lake workstation
> (i5-12600).

Hi All!
I have some problems about your benchmark test.
In Ubuntu 22.02, I compiled an RFL kernel with both C null_blk driver and Rust null_blk_driver as modules. For the C null_blk driver, I used the `modprobe` command to set relevant parameters, while for the Rust null_blk_driver, I simply started it. I used the following two commands to start the drivers:

```bash
sudo modprobe null_blk \
    queue_mode=2 irqmode=0 hw_queue_depth=256 \
    memory_backed=1 bs=4096 completion_nsec=0 \
    no_sched=1 gb=4;
sudo modprobe rnull_mod
```

After that, I tested their performance in `randread` with the fio command, specifying the first parameter as 4 and the second parameter as 1:

```bash
fio --filename=/dev/nullb0 --iodepth=64 --ioengine=psync --direct=1 --rw=randread --bs=$1k --numjobs=$2 --runtime=30 --group_reporting --name=test-rand-read --output=test_c_randread.log
fio --filename=/dev/rnullb0 --iodepth=64 --ioengine=psync --direct=1 --rw=randread --bs=$1k --numjobs=$2 --runtime=30 --group_reporting --name=test-rand-read --output=test_rust_randread.log
```

The test results showed a significant performance difference between the two drivers, which was drastically different from the data you tested in the community. Specifically, for `randread`, the C driver had a bandwidth of 487MiB/s and IOPS of 124.7k, while the Rust driver had a bandwidth of 264MiB/s and IOPS of 67.7k. However, for other I/O types, the performance of the C and Rust drivers was more similar. Could you please provide more information about the actual bandwidth and IOPS data from your tests, rather than just the performance difference between the C and Rust drivers? Additionally, if you could offer possible reasons for this abnormality, I would greatly appreciate it!

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
  2023-05-03  9:06 [RFC PATCH 00/11] Rust null block driver Andreas Hindborg
                   ` (14 preceding siblings ...)
  2023-07-27  3:45 ` Yexuan Yang
@ 2023-07-27  3:47 ` Yexuan Yang
       [not found] ` <2B3CA5F1CCCFEAB2+20230727034517.GB126117@1182282462>
  16 siblings, 0 replies; 69+ messages in thread
From: Yexuan Yang @ 2023-07-27  3:47 UTC (permalink / raw)
  To: Andreas Hindborg
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Andreas Hindborg, Matthew Wilcox, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Bj??rn Roy Baron,
	Benno Lossin, open list, gost.dev, 1182282462

> In this table each cell shows the relative performance of the Rust
> driver to the C driver with the throughput of the C driver in parenthesis:
> `rel_read rel_write (read_miops write_miops)`. Ex: the Rust driver performs 4.74
> percent better than the C driver for fio randread with 2 jobs at 16 KiB
> block size.
> 
> Over the 432 benchmark configurations, the relative performance of the Rust
> driver to the C driver (in terms of IOPS) is between 6.8 and -11.8 percent with
> an average of 0.2 percent better for reads. For writes the span is 16.8 to -4.5
> percent with an average of 0.9 percent better.
> 
> For each measurement the drivers are loaded, a drive is configured with memory
> backing and a size of 4 GiB. C null_blk is configured to match the implemented
> modes of the Rust driver: `blocksize` is set to 4 KiB, `completion_nsec` to 0,
> `irqmode` to 0 (IRQ_NONE), `queue_mode` to 2 (MQ), `hw_queue_depth` to 256 and
> `memory_backed` to 1. For both the drivers, the queue scheduler is set to
> `none`. These measurements are made using 30 second runs of `fio` with the
> `PSYNC` IO engine with workers pinned to separate CPU cores. The measurements
> are done inside a virtual machine (qemu/kvm) on an Intel Alder Lake workstation
> (i5-12600).
Hi All!
I have some problems about your benchmark test.
In Ubuntu 22.02, I compiled an RFL kernel with both C null_blk driver and Rust null_blk_driver as modules. For the C null_blk driver, I used the `modprobe` command to set relevant parameters, while for the Rust null_blk_driver, I simply started it. I used the following two commands to start the drivers:

```bash
sudo modprobe null_blk \
    queue_mode=2 irqmode=0 hw_queue_depth=256 \
    memory_backed=1 bs=4096 completion_nsec=0 \
    no_sched=1 gb=4;
sudo modprobe rnull_mod
```

After that, I tested their performance in `randread` with the fio command, specifying the first parameter as 4 and the second parameter as 1:

```bash
fio --filename=/dev/nullb0 --iodepth=64 --ioengine=psync --direct=1 --rw=randread --bs=$1k --numjobs=$2 --runtime=30 --group_reporting --name=test-rand-read --output=test_c_randread.log
fio --filename=/dev/rnullb0 --iodepth=64 --ioengine=psync --direct=1 --rw=randread --bs=$1k --numjobs=$2 --runtime=30 --group_reporting --name=test-rand-read --output=test_rust_randread.log
```

The test results showed a significant performance difference between the two drivers, which was drastically different from the data you tested in the community. Specifically, for `randread`, the C driver had a bandwidth of 487MiB/s and IOPS of 124.7k, while the Rust driver had a bandwidth of 264MiB/s and IOPS of 67.7k. However, for other I/O types, the performance of the C and Rust drivers was more similar. Could you please provide more information about the actual bandwidth and IOPS data from your tests, rather than just the performance difference between the C and Rust drivers? Additionally, if you could offer possible reasons for this abnormality, I would greatly appreciate it!

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
       [not found] ` <2B3CA5F1CCCFEAB2+20230727034517.GB126117@1182282462>
@ 2023-07-28  6:49   ` Andreas Hindborg (Samsung)
  2023-07-31 14:14     ` Andreas Hindborg (Samsung)
  0 siblings, 1 reply; 69+ messages in thread
From: Andreas Hindborg (Samsung) @ 2023-07-28  6:49 UTC (permalink / raw)
  To: Yexuan Yang
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Matthew Wilcox, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Bj??rn Roy Baron, Benno Lossin, open list,
	gost.dev


Hi,

Yexuan Yang <1182282462@bupt.edu.cn> writes:

>> Over the 432 benchmark configurations, the relative performance of the Rust
>> driver to the C driver (in terms of IOPS) is between 6.8 and -11.8 percent with
>> an average of 0.2 percent better for reads. For writes the span is 16.8 to -4.5
>> percent with an average of 0.9 percent better.
>> 
>> For each measurement the drivers are loaded, a drive is configured with memory
>> backing and a size of 4 GiB. C null_blk is configured to match the implemented
>> modes of the Rust driver: `blocksize` is set to 4 KiB, `completion_nsec` to 0,
>> `irqmode` to 0 (IRQ_NONE), `queue_mode` to 2 (MQ), `hw_queue_depth` to 256 and
>> `memory_backed` to 1. For both the drivers, the queue scheduler is set to
>> `none`. These measurements are made using 30 second runs of `fio` with the
>> `PSYNC` IO engine with workers pinned to separate CPU cores. The measurements
>> are done inside a virtual machine (qemu/kvm) on an Intel Alder Lake workstation
>> (i5-12600).
>
> Hi All!
> I have some problems about your benchmark test.
> In Ubuntu 22.02, I compiled an RFL kernel with both C null_blk driver and Rust
> null_blk_driver as modules. For the C null_blk driver, I used the `modprobe`
> command to set relevant parameters, while for the Rust null_blk_driver, I simply
> started it. I used the following two commands to start the drivers:
>
> ```bash
> sudo modprobe null_blk \
>     queue_mode=2 irqmode=0 hw_queue_depth=256 \
>     memory_backed=1 bs=4096 completion_nsec=0 \
>     no_sched=1 gb=4;
> sudo modprobe rnull_mod
> ```
>
> After that, I tested their performance in `randread` with the fio command, specifying the first parameter as 4 and the second parameter as 1:
>
> ```bash
> fio --filename=/dev/nullb0 --iodepth=64 --ioengine=psync --direct=1 --rw=randread --bs=$1k --numjobs=$2 --runtime=30 --group_reporting --name=test-rand-read --output=test_c_randread.log
> fio --filename=/dev/rnullb0 --iodepth=64 --ioengine=psync --direct=1 --rw=randread --bs=$1k --numjobs=$2 --runtime=30 --group_reporting --name=test-rand-read --output=test_rust_randread.log
> ```
>
> The test results showed a significant performance difference between the two
> drivers, which was drastically different from the data you tested in the
> community. Specifically, for `randread`, the C driver had a bandwidth of
> 487MiB/s and IOPS of 124.7k, while the Rust driver had a bandwidth of 264MiB/s
> and IOPS of 67.7k. However, for other I/O types, the performance of the C and
> Rust drivers was more similar. Could you please provide more information about
> the actual bandwidth and IOPS data from your tests, rather than just the
> performance difference between the C and Rust drivers? Additionally, if you
> could offer possible reasons for this abnormality, I would greatly appreciate
> it!

Thanks for trying out the code! I am not sure why you get these numbers.
I am currently out of office, but I will rerun the benchmarks next week
when I get back in. Maybe I can provide you with some scripts and my
kernel configuration. Hopefully we can figure out the difference in our
setups.

Best regards,
Andreas

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 00/11] Rust null block driver
  2023-07-28  6:49   ` Andreas Hindborg (Samsung)
@ 2023-07-31 14:14     ` Andreas Hindborg (Samsung)
  0 siblings, 0 replies; 69+ messages in thread
From: Andreas Hindborg (Samsung) @ 2023-07-31 14:14 UTC (permalink / raw)
  To: Andreas Hindborg (Samsung)
  Cc: Yexuan Yang, Jens Axboe, Christoph Hellwig, Keith Busch,
	Damien Le Moal, Hannes Reinecke, lsf-pc, rust-for-linux,
	linux-block, Matthew Wilcox, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Bj??rn Roy Baron,
	Benno Lossin, open list, gost.dev


Hi,

"Andreas Hindborg (Samsung)" <nmi@metaspace.dk> writes:

> Hi,
>
> Yexuan Yang <1182282462@bupt.edu.cn> writes:
>
>>> Over the 432 benchmark configurations, the relative performance of the Rust
>>> driver to the C driver (in terms of IOPS) is between 6.8 and -11.8 percent with
>>> an average of 0.2 percent better for reads. For writes the span is 16.8 to -4.5
>>> percent with an average of 0.9 percent better.
>>> 
>>> For each measurement the drivers are loaded, a drive is configured with memory
>>> backing and a size of 4 GiB. C null_blk is configured to match the implemented
>>> modes of the Rust driver: `blocksize` is set to 4 KiB, `completion_nsec` to 0,
>>> `irqmode` to 0 (IRQ_NONE), `queue_mode` to 2 (MQ), `hw_queue_depth` to 256 and
>>> `memory_backed` to 1. For both the drivers, the queue scheduler is set to
>>> `none`. These measurements are made using 30 second runs of `fio` with the
>>> `PSYNC` IO engine with workers pinned to separate CPU cores. The measurements
>>> are done inside a virtual machine (qemu/kvm) on an Intel Alder Lake workstation
>>> (i5-12600).
>>
>> Hi All!
>> I have some problems about your benchmark test.
>> In Ubuntu 22.02, I compiled an RFL kernel with both C null_blk driver and Rust
>> null_blk_driver as modules. For the C null_blk driver, I used the `modprobe`
>> command to set relevant parameters, while for the Rust null_blk_driver, I simply
>> started it. I used the following two commands to start the drivers:
>>
>> ```bash
>> sudo modprobe null_blk \
>>     queue_mode=2 irqmode=0 hw_queue_depth=256 \
>>     memory_backed=1 bs=4096 completion_nsec=0 \
>>     no_sched=1 gb=4;
>> sudo modprobe rnull_mod
>> ```
>>
>> After that, I tested their performance in `randread` with the fio command, specifying the first parameter as 4 and the second parameter as 1:
>>
>> ```bash
>> fio --filename=/dev/nullb0 --iodepth=64 --ioengine=psync --direct=1 --rw=randread --bs=$1k --numjobs=$2 --runtime=30 --group_reporting --name=test-rand-read --output=test_c_randread.log
>> fio --filename=/dev/rnullb0 --iodepth=64 --ioengine=psync --direct=1 --rw=randread --bs=$1k --numjobs=$2 --runtime=30 --group_reporting --name=test-rand-read --output=test_rust_randread.log
>> ```
>>
>> The test results showed a significant performance difference between the two
>> drivers, which was drastically different from the data you tested in the
>> community. Specifically, for `randread`, the C driver had a bandwidth of
>> 487MiB/s and IOPS of 124.7k, while the Rust driver had a bandwidth of 264MiB/s
>> and IOPS of 67.7k. However, for other I/O types, the performance of the C and
>> Rust drivers was more similar. Could you please provide more information about
>> the actual bandwidth and IOPS data from your tests, rather than just the
>> performance difference between the C and Rust drivers? Additionally, if you
>> could offer possible reasons for this abnormality, I would greatly appreciate
>> it!
>
> Thanks for trying out the code! I am not sure why you get these numbers.
> I am currently out of office, but I will rerun the benchmarks next week
> when I get back in. Maybe I can provide you with some scripts and my
> kernel configuration. Hopefully we can figure out the difference in our
> setups.
>

I just ran the benchmarks for that configuration again. My setup is an
Intel Alder Lake CPU (i5-12600) with Debian Bullseye user space running
a 6.2.12 host kernel with a kernel config based on the stock Debian
Bullseye (debian config + make olddefconfig). 32 GB of DDR5 4800MHz
memory. The benchmarks (and patched kernel) run inside QEMU KVM (from
Debian repos Debian 1:5.2+dfsg-11+deb11u2). I use the following qemu
command to boot (edited a bit for clarity):

"qemu-system-x86_64" "-nographic" "-enable-kvm" "-m" "16G" "-cpu" "host"
"-M" "q35" "-smp" "6" "-kernel" "vmlinux" "-append" "console=ttyS0
nokaslr rdinit=/sbin/init root=/dev/vda1 null_blk.nr_devices=0"

I use a Debian Bullseye user space inside the virtual machine.

I think I used the stock Bullseye kernel for the host when I did the
numbers for the cover letter, so that is different for this run.

Here are the results:

+---------+----------+------+---------------------+
| jobs/bs | workload | prep |          4          |
+---------+----------+------+---------------------+
|    4    | randread |  1   | 0.28 0.00 (1.7,0.0) |
|    4    | randread |  0   | 5.75 0.00 (6.8,0.0) |
+---------+----------+------+---------------------+

I used the following parameters for both tests:

Block size: 4k
Run time: 120s
Workload: randread
Jobs: 4

This is the `fio` command I used:

['fio', '--group_reporting', '--name=default', '--filename=/dev/<snip>', '--time_based=1', '--runtime=120', '--rw=randread', '--output=<snip>', '--output-format=json', '--numjobs=4', '--cpus_allowed=0-3', '--cpus_allowed_policy=split', '--ioengine=psync', '--bs=4k', '--direct=1', '--norandommap', '--random_generator=lfsr']

For the line in the table where `prep` is 1 I filled up the target
device with data before running the benchmark. I use the following `fio`
job to do that:

['fio', '--name=prep', '--rw=write', '--filename=/dev/None', '--bs=4k', '--direct=1']

Did you prepare the volumes with data in your experiment?

How to read the data in the table: For the benchmark where the drive is
prepared, Rust null_blk perform 0.28 percent better than C null_block.
For the case where the drive is empty Rust null_block performs 5.75
percent better than C null_block. I calculated the relative performance
as "(r_iops - c_iops) / c_iops * 100".

The IOPS info you request is present in the tables from the cover
letter. As mentioned in the cover letter, the numbers in parenthesis in
each cell is throughput in 1,000,000 IOPS for read and
write for the C null_blk driver. For the table above, C null_blk did
1,700,000 read IOPS in the case when the drive is prepared and 5,750,000
IOPS in the case where the drive is not prepared.

You can obtain bandwidth by multiplying IOPS by block size. I can also
provide the raw json output of `fio` if you want to have a look.

One last note is that I unload the module between each run and I do not
have both modules loaded at the same time. If you are exhausting you
memory pool this could maybe impact performance?

So things to check:
 - Do you prepare the drive?
 - Do you release drive memory after test (you can unload the module)
 - Use the lfsr random number generator
 - Do not use the randommap

I am not sure what hardware you are running on, but the throughput
numbers you obtain seem _really_ low.

Best regards
Andreas Hindborg

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 04/11] rust: block: introduce `kernel::block::bio` module
  2023-05-08 12:58   ` Benno Lossin
@ 2024-01-11 12:49     ` Andreas Hindborg (Samsung)
  2024-02-28 14:31       ` Andreas Hindborg
  0 siblings, 1 reply; 69+ messages in thread
From: Andreas Hindborg (Samsung) @ 2024-01-11 12:49 UTC (permalink / raw)
  To: Benno Lossin
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Matthew Wilcox, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, linux-kernel,
	gost.dev


Benno Lossin <benno.lossin@proton.me> writes:

>> +/// A wrapper around a `struct bio` pointer
>> +///
>> +/// # Invariants
>> +///
>> +/// First field must alwyas be a valid pointer to a valid `struct bio`.
>> +pub struct Bio<'a>(
>> +    NonNull<crate::bindings::bio>,
>> +    core::marker::PhantomData<&'a ()>,
>
> Please make this a struct with named fields. Also this is rather a
> `BioRef`, right? Why can't `&Bio` be used and `Bio` embeds
> `bindings::bio`?

Yes, it feels better with a &Bio reference, thanks.

>> +);
>> +
>> +impl<'a> Bio<'a> {
>> +    /// Returns an iterator over segments in this `Bio`. Does not consider
>> +    /// segments of other bios in this bio chain.
>> +    #[inline(always)]
>
> Why are these `inline(always)`? The compiler should inline them
> automatically?

No, the compiler would not inline into modules without them. I'll check
again if that is still the case.

>
>> +    pub fn segment_iter(&'a self) -> BioSegmentIterator<'a> {
>> +        BioSegmentIterator::new(self)
>> +    }
>> +
>> +    /// Get a pointer to the `bio_vec` off this bio
>> +    #[inline(always)]
>> +    fn io_vec(&self) -> *const bindings::bio_vec {
>> +        // SAFETY: By type invariant, get_raw() returns a valid pointer to a
>> +        // valid `struct bio`
>> +        unsafe { (*self.get_raw()).bi_io_vec }
>> +    }
>> +
>> +    /// Return a copy of the `bvec_iter` for this `Bio`
>> +    #[inline(always)]
>> +    fn iter(&self) -> bindings::bvec_iter {
>
> Why does this return the bindings iter? Maybe rename to `raw_iter`?

Makes sense to rename it, thanks.

>> +        // SAFETY: self.0 is always a valid pointer
>> +        unsafe { (*self.get_raw()).bi_iter }
>> +    }
>> +
>> +    /// Get the next `Bio` in the chain
>> +    #[inline(always)]
>> +    fn next(&self) -> Option<Bio<'a>> {
>> +        // SAFETY: self.0 is always a valid pointer
>> +        let next = unsafe { (*self.get_raw()).bi_next };
>> +        Some(Self(NonNull::new(next)?, core::marker::PhantomData))
>
> Missing `INVARIANT` explaining why `next` is valid or null. Also why not
> use `Self::from_raw` here?

Thanks, will change that.

>
>> +    }
>> +
>> +    /// Return the raw pointer of the wrapped `struct bio`
>> +    #[inline(always)]
>> +    fn get_raw(&self) -> *const bindings::bio {
>> +        self.0.as_ptr()
>> +    }
>> +
>> +    /// Create an instance of `Bio` from a raw pointer. Does check that the
>> +    /// pointer is not null.
>> +    #[inline(always)]
>> +    pub(crate) unsafe fn from_raw(ptr: *mut bindings::bio) -> Option<Bio<'a>> {
>> +        Some(Self(NonNull::new(ptr)?, core::marker::PhantomData))
>> +    }
>> +}
>> +
>> +impl core::fmt::Display for Bio<'_> {
>> +    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
>> +        write!(f, "Bio {:?}", self.0.as_ptr())
>
> this will display as `Bio 0x7ff0654..` I think there should be some
> symbol wrapping the pointer like `Bio(0x7ff0654)` or `Bio { ptr: 0x7ff0654 }`.
>

Sure 👍

>> +    }
>> +}
>> +
>> +/// An iterator over `Bio`
>> +pub struct BioIterator<'a> {
>> +    pub(crate) bio: Option<Bio<'a>>,
>> +}
>> +
>> +impl<'a> core::iter::Iterator for BioIterator<'a> {
>> +    type Item = Bio<'a>;
>> +
>> +    #[inline(always)]
>> +    fn next(&mut self) -> Option<Bio<'a>> {
>> +        if let Some(current) = self.bio.take() {
>> +            self.bio = current.next();
>> +            Some(current)
>> +        } else {
>> +            None
>> +        }
>
> Can be rewritten as:
>     let current = self.bio.take()?;
>     self.bio = current.next();
>     Some(cur)
>

Thanks 👍

>> +    }
>> +}
>> diff --git a/rust/kernel/block/bio/vec.rs b/rust/kernel/block/bio/vec.rs
>> new file mode 100644
>> index 000000000000..acd328a6fe54
>> --- /dev/null
>> +++ b/rust/kernel/block/bio/vec.rs
>> @@ -0,0 +1,181 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +//! Types for working with `struct bio_vec` IO vectors
>> +//!
>> +//! C header: [`include/linux/bvec.h`](../../include/linux/bvec.h)
>> +
>> +use super::Bio;
>> +use crate::error::Result;
>> +use crate::pages::Pages;
>> +use core::fmt;
>> +use core::mem::ManuallyDrop;
>> +
>> +#[inline(always)]
>> +fn mp_bvec_iter_offset(bvec: *const bindings::bio_vec, iter: &bindings::bvec_iter) -> u32 {
>> +    (unsafe { (*bvec_iter_bvec(bvec, iter)).bv_offset }) + iter.bi_bvec_done
>> +}
>> +
>> +#[inline(always)]
>> +fn mp_bvec_iter_page(
>> +    bvec: *const bindings::bio_vec,
>> +    iter: &bindings::bvec_iter,
>> +) -> *mut bindings::page {
>> +    unsafe { (*bvec_iter_bvec(bvec, iter)).bv_page }
>> +}
>> +
>> +#[inline(always)]
>> +fn mp_bvec_iter_page_idx(bvec: *const bindings::bio_vec, iter: &bindings::bvec_iter) -> usize {
>> +    (mp_bvec_iter_offset(bvec, iter) / crate::PAGE_SIZE) as usize
>> +}
>> +
>> +#[inline(always)]
>> +fn mp_bvec_iter_len(bvec: *const bindings::bio_vec, iter: &bindings::bvec_iter) -> u32 {
>> +    iter.bi_size
>> +        .min(unsafe { (*bvec_iter_bvec(bvec, iter)).bv_len } - iter.bi_bvec_done)
>> +}
>> +
>> +#[inline(always)]
>> +fn bvec_iter_bvec(
>> +    bvec: *const bindings::bio_vec,
>> +    iter: &bindings::bvec_iter,
>> +) -> *const bindings::bio_vec {
>> +    unsafe { bvec.add(iter.bi_idx as usize) }
>> +}
>> +
>> +#[inline(always)]
>> +fn bvec_iter_page(
>> +    bvec: *const bindings::bio_vec,
>> +    iter: &bindings::bvec_iter,
>> +) -> *mut bindings::page {
>> +    unsafe { mp_bvec_iter_page(bvec, iter).add(mp_bvec_iter_page_idx(bvec, iter)) }
>> +}
>> +
>> +#[inline(always)]
>> +fn bvec_iter_len(bvec: *const bindings::bio_vec, iter: &bindings::bvec_iter) -> u32 {
>> +    mp_bvec_iter_len(bvec, iter).min(crate::PAGE_SIZE - bvec_iter_offset(bvec, iter))
>> +}
>> +
>> +#[inline(always)]
>> +fn bvec_iter_offset(bvec: *const bindings::bio_vec, iter: &bindings::bvec_iter) -> u32 {
>> +    mp_bvec_iter_offset(bvec, iter) % crate::PAGE_SIZE
>> +}
>
> Why are these functions:
> - not marked as `unsafe`?
> - undocumented,
> - free functions.
>
> Can't these be directly implemented on `Segment<'_>`? If not,
> I think we should find some better way or make them `unsafe`.

Yes, you are right. I will move them to `Segment` and document them
better. They definitely need to be unsafe because they rely on C API
contract that the iterator is not indexing out of bounds.

[...]

>> +impl<'a> core::iter::Iterator for BioSegmentIterator<'a> {
>> +    type Item = Segment<'a>;
>> +
>> +    #[inline(always)]
>> +    fn next(&mut self) -> Option<Self::Item> {
>> +        if self.iter.bi_size == 0 {
>> +            return None;
>> +        }
>> +
>> +        // Macro
>> +        // bio_vec = bio_iter_iovec(bio, self.iter)
>> +        // bio_vec = bvec_iter_bvec(bio.bi_io_vec, self.iter);
>
> Weird comment?

Yes, will fix, thanks.

Thanks for the comments!

BR Andreas

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 03/11] rust: block: introduce `kernel::block::mq` module
  2023-05-08 12:29   ` Benno Lossin
  2023-05-11  6:52     ` Sergio González Collado
@ 2024-01-12  9:18     ` Andreas Hindborg (Samsung)
  2024-01-23 16:14       ` Benno Lossin
  1 sibling, 1 reply; 69+ messages in thread
From: Andreas Hindborg (Samsung) @ 2024-01-12  9:18 UTC (permalink / raw)
  To: Benno Lossin
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Matthew Wilcox, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, linux-kernel,
	gost.dev


Hi Benno,

Benno Lossin <benno.lossin@proton.me> writes:

<...>

>> diff --git a/rust/kernel/block/mq/gen_disk.rs b/rust/kernel/block/mq/gen_disk.rs
>> new file mode 100644
>> index 000000000000..50496af15bbf
>> --- /dev/null
>> +++ b/rust/kernel/block/mq/gen_disk.rs
>> @@ -0,0 +1,133 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +//! GenDisk abstraction
>> +//!
>> +//! C header: [`include/linux/blkdev.h`](../../include/linux/blkdev.h)
>> +//! C header: [`include/linux/blk_mq.h`](../../include/linux/blk_mq.h)
>> +
>> +use crate::block::mq::{raw_writer::RawWriter, Operations, TagSet};
>> +use crate::{
>> +    bindings, error::from_err_ptr, error::Result, sync::Arc, types::ForeignOwnable,
>> +    types::ScopeGuard,
>> +};
>> +use core::fmt::{self, Write};
>> +
>> +/// A generic block device
>> +///
>> +/// # Invariants
>> +///
>> +///  - `gendisk` must always point to an initialized and valid `struct gendisk`.
>> +pub struct GenDisk<T: Operations> {
>> +    _tagset: Arc<TagSet<T>>,
>> +    gendisk: *mut bindings::gendisk,
>
> Why are these two fields not embedded? Shouldn't the user decide where
> to allocate?

The `TagSet` can be shared between multiple `GenDisk`. Using an `Arc`
seems resonable?

For the `gendisk` field, the allocation is done by C and the address
must be stable. We are owning the pointee and must drop it when it goes out
of scope. I could do this:

#[repr(transparent)]
struct GenDisk(Opaque<bindings::gendisk>);

struct UniqueGenDiskRef {
    _tagset: Arc<TagSet<T>>,
    gendisk: Pin<&'static mut GenDisk>,

}

but it seems pointless. `struct GenDisk` would not be pub in that case. What do you think?

>
>> +}
>> +
>> +// SAFETY: `GenDisk` is an owned pointer to a `struct gendisk` and an `Arc` to a
>> +// `TagSet` It is safe to send this to other threads as long as T is Send.
>> +unsafe impl<T: Operations + Send> Send for GenDisk<T> {}
>> +
>> +impl<T: Operations> GenDisk<T> {
>> +    /// Try to create a new `GenDisk`
>> +    pub fn try_new(tagset: Arc<TagSet<T>>, queue_data: T::QueueData) -> Result<Self> {
>> +        let data = queue_data.into_foreign();
>> +        let recover_data = ScopeGuard::new(|| {
>> +            // SAFETY: T::QueueData was created by the call to `into_foreign()` above
>> +            unsafe { T::QueueData::from_foreign(data) };
>> +        });
>> +
>> +        let lock_class_key = crate::sync::LockClassKey::new();
>> +
>> +        // SAFETY: `tagset.raw_tag_set()` points to a valid and initialized tag set
>> +        let gendisk = from_err_ptr(unsafe {
>> +            bindings::__blk_mq_alloc_disk(tagset.raw_tag_set(), data as _, lock_class_key.as_ptr())
>
> Avoid `as _` casts.

👍

>
>> +        })?;
>> +
>> +        const TABLE: bindings::block_device_operations = bindings::block_device_operations {
>> +            submit_bio: None,
>> +            open: None,
>> +            release: None,
>> +            ioctl: None,
>> +            compat_ioctl: None,
>> +            check_events: None,
>> +            unlock_native_capacity: None,
>> +            getgeo: None,
>> +            set_read_only: None,
>> +            swap_slot_free_notify: None,
>> +            report_zones: None,
>> +            devnode: None,
>> +            alternative_gpt_sector: None,
>> +            get_unique_id: None,
>> +            owner: core::ptr::null_mut(),
>> +            pr_ops: core::ptr::null_mut(),
>> +            free_disk: None,
>> +            poll_bio: None,
>> +        };
>> +
>> +        // SAFETY: gendisk is a valid pointer as we initialized it above
>> +        unsafe { (*gendisk).fops = &TABLE };
>> +
>> +        recover_data.dismiss();
>> +        Ok(Self {
>> +            _tagset: tagset,
>> +            gendisk,
>> +        })
>> +    }
>> +
>> +    /// Set the name of the device
>> +    pub fn set_name(&self, args: fmt::Arguments<'_>) -> Result {
>> +        let mut raw_writer = RawWriter::from_array(unsafe { &mut (*self.gendisk).disk_name });
>
> Missing `SAFETY` also see below.

Yes, I have a few of those. Will add for next version.

>
>> +        raw_writer.write_fmt(args)?;
>> +        raw_writer.write_char('\0')?;
>> +        Ok(())
>> +    }
>> +
>> +    /// Register the device with the kernel. When this function return, the
>> +    /// device is accessible from VFS. The kernel may issue reads to the device
>> +    /// during registration to discover partition infomation.
>> +    pub fn add(&self) -> Result {
>> +        crate::error::to_result(unsafe {
>> +            bindings::device_add_disk(core::ptr::null_mut(), self.gendisk, core::ptr::null_mut())
>> +        })
>> +    }
>> +
>> +    /// Call to tell the block layer the capcacity of the device
>> +    pub fn set_capacity(&self, sectors: u64) {
>> +        unsafe { bindings::set_capacity(self.gendisk, sectors) };
>> +    }
>> +
>> +    /// Set the logical block size of the device
>> +    pub fn set_queue_logical_block_size(&self, size: u32) {
>> +        unsafe { bindings::blk_queue_logical_block_size((*self.gendisk).queue, size) };
>> +    }
>> +
>> +    /// Set the physical block size of the device
>
> What does this *do*? I do not think the doc string gives any meaningful
> information not present in the function name (this might just be,
> because I have no idea of what this is and anyone with just a little
> more knowledge would know, but I still wanted to mention it).

I'll add some more context.

>
>> +    pub fn set_queue_physical_block_size(&self, size: u32) {
>> +        unsafe { bindings::blk_queue_physical_block_size((*self.gendisk).queue, size) };
>> +    }
>> +
>> +    /// Set the rotational media attribute for the device
>> +    pub fn set_rotational(&self, rotational: bool) {
>> +        if !rotational {
>> +            unsafe {
>> +                bindings::blk_queue_flag_set(bindings::QUEUE_FLAG_NONROT, (*self.gendisk).queue)
>> +            };
>> +        } else {
>> +            unsafe {
>> +                bindings::blk_queue_flag_clear(bindings::QUEUE_FLAG_NONROT, (*self.gendisk).queue)
>> +            };
>> +        }
>> +    }
>> +}
>> +
>> +impl<T: Operations> Drop for GenDisk<T> {
>> +    fn drop(&mut self) {
>> +        let queue_data = unsafe { (*(*self.gendisk).queue).queuedata };
>> +
>> +        unsafe { bindings::del_gendisk(self.gendisk) };
>> +
>> +        // SAFETY: `queue.queuedata` was created by `GenDisk::try_new()` with a
>> +        // call to `ForeignOwnable::into_pointer()` to create `queuedata`.
>> +        // `ForeignOwnable::from_foreign()` is only called here.
>> +        let _queue_data = unsafe { T::QueueData::from_foreign(queue_data) };
>> +    }
>> +}
>> diff --git a/rust/kernel/block/mq/operations.rs b/rust/kernel/block/mq/operations.rs
>> new file mode 100644
>> index 000000000000..fb1ab707d1f0
>> --- /dev/null
>> +++ b/rust/kernel/block/mq/operations.rs
>> @@ -0,0 +1,260 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +//! This module provides an interface for blk-mq drivers to implement.
>> +//!
>> +//! C header: [`include/linux/blk-mq.h`](../../include/linux/blk-mq.h)
>> +
>> +use crate::{
>> +    bindings,
>> +    block::mq::{tag_set::TagSetRef, Request},
>> +    error::{from_result, Result},
>> +    types::ForeignOwnable,
>> +};
>> +use core::{marker::PhantomData, pin::Pin};
>> +
>> +/// Implement this trait to interface blk-mq as block devices
>> +#[macros::vtable]
>> +pub trait Operations: Sized {
>
> Is this trait really safe? Are there **no** requirements for e.g.
> `QueueData`? So could I use `Box<()>`?

Yes, it is intended to be safe. `ForeignOwnable` covers safety
requirements for these associated data types.

>
>> +    /// Data associated with a request. This data is located next to the request
>> +    /// structure.
>> +    type RequestData;
>> +
>> +    /// Data associated with the `struct request_queue` that is allocated for
>> +    /// the `GenDisk` associated with this `Operations` implementation.
>> +    type QueueData: ForeignOwnable;
>> +
>> +    /// Data associated with a dispatch queue. This is stored as a pointer in
>> +    /// `struct blk_mq_hw_ctx`.
>> +    type HwData: ForeignOwnable;
>> +
>> +    /// Data associated with a tag set. This is stored as a pointer in `struct
>> +    /// blk_mq_tag_set`.
>> +    type TagSetData: ForeignOwnable;
>> +
>> +    /// Called by the kernel to allocate a new `RequestData`. The structure will
>> +    /// eventually be pinned, so defer initialization to `init_request_data()`
>> +    fn new_request_data(
>> +        _tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
>> +    ) -> Result<Self::RequestData>;
>> +
>> +    /// Called by the kernel to initialize a previously allocated `RequestData`
>> +    fn init_request_data(
>> +        _tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
>> +        _data: Pin<&mut Self::RequestData>,
>> +    ) -> Result {
>> +        Ok(())
>> +    }
>> +
>> +    /// Called by the kernel to queue a request with the driver. If `is_last` is
>> +    /// `false`, the driver is allowed to defer commiting the request.
>> +    fn queue_rq(
>> +        hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>,
>> +        queue_data: <Self::QueueData as ForeignOwnable>::Borrowed<'_>,
>> +        rq: &Request<Self>,
>> +        is_last: bool,
>> +    ) -> Result;
>> +
>> +    /// Called by the kernel to indicate that queued requests should be submitted
>> +    fn commit_rqs(
>> +        hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>,
>> +        queue_data: <Self::QueueData as ForeignOwnable>::Borrowed<'_>,
>> +    );
>> +
>> +    /// Called by the kernel when the request is completed
>> +    fn complete(_rq: &Request<Self>);
>> +
>> +    /// Called by the kernel to allocate and initialize a driver specific hardware context data
>> +    fn init_hctx(
>> +        tagset_data: <Self::TagSetData as ForeignOwnable>::Borrowed<'_>,
>> +        hctx_idx: u32,
>> +    ) -> Result<Self::HwData>;
>> +
>> +    /// Called by the kernel to poll the device for completed requests. Only used for poll queues.
>> +    fn poll(_hw_data: <Self::HwData as ForeignOwnable>::Borrowed<'_>) -> i32 {
>> +        unreachable!()
>
> Why are these implemented this way? Should this really panic? Maybe
> return an error? Why `i32` as the return type? If it can error it should
> be `Result<u32>`.

I will update in accordance with the new documentation for `#[vtable]`.

Return type should be `bool`, I will change it. It inherited the int
from `core::ffi::c_int`.

>
>> +    }
>> +
>> +    /// Called by the kernel to map submission queues to CPU cores.
>> +    fn map_queues(_tag_set: &TagSetRef) {
>> +        unreachable!()
>> +    }
>> +
>> +    // There is no need for exit_request() because `drop` will be called.
>> +}
>> +
>> +pub(crate) struct OperationsVtable<T: Operations>(PhantomData<T>);
>> +
>> +impl<T: Operations> OperationsVtable<T> {
>> +    // # Safety
>> +    //
>> +    // The caller of this function must ensure that `hctx` and `bd` are valid
>> +    // and initialized. The pointees must outlive this function. Further
>> +    // `hctx->driver_data` must be a pointer created by a call to
>> +    // `Self::init_hctx_callback()` and the pointee must outlive this function.
>> +    // This function must not be called with a `hctx` for which
>> +    // `Self::exit_hctx_callback()` has been called.
>> +    unsafe extern "C" fn queue_rq_callback(
>> +        hctx: *mut bindings::blk_mq_hw_ctx,
>> +        bd: *const bindings::blk_mq_queue_data,
>> +    ) -> bindings::blk_status_t {
>> +        // SAFETY: `bd` is valid as required by the safety requirement for this function.
>> +        let rq = unsafe { (*bd).rq };
>> +
>> +        // SAFETY: The safety requirement for this function ensure that
>> +        // `(*hctx).driver_data` was returned by a call to
>> +        // `Self::init_hctx_callback()`. That function uses
>> +        // `PointerWrapper::into_pointer()` to create `driver_data`. Further,
>> +        // the returned value does not outlive this function and
>> +        // `from_foreign()` is not called until `Self::exit_hctx_callback()` is
>> +        // called. By the safety requirement of this function and contract with
>> +        // the `blk-mq` API, `queue_rq_callback()` will not be called after that
>> +        // point.
>
> This safety section and the others here are rather long and mostly
> repeat themselves. Is it possible to put this in its own module and
> explain the safety invariants in that module and then in these safety
> sections just refer to some labels from that section?
>
> I think we should discuss this in our next meeting.

Not sure about the best way to do this. Lets talk.

<...>

>> +
>> +    pub(crate) const unsafe fn build() -> &'static bindings::blk_mq_ops {
>> +        &Self::VTABLE
>> +    }
>
> Why is this function `unsafe`?

I don't think it needs to be unsafe, thanks.

>
>> +}
>
> Some `# Safety` and `SAFETY` missing in this hunk.
>
>> diff --git a/rust/kernel/block/mq/raw_writer.rs b/rust/kernel/block/mq/raw_writer.rs
>> new file mode 100644
>> index 000000000000..25c16ee0b1f7
>> --- /dev/null
>> +++ b/rust/kernel/block/mq/raw_writer.rs
>> @@ -0,0 +1,30 @@
>> +use core::fmt::{self, Write};
>> +
>> +pub(crate) struct RawWriter {
>> +    ptr: *mut u8,
>> +    len: usize,
>> +}
>> +
>> +impl RawWriter {
>> +    unsafe fn new(ptr: *mut u8, len: usize) -> Self {
>> +        Self { ptr, len }
>> +    }
>> +
>> +    pub(crate) fn from_array<const N: usize>(a: &mut [core::ffi::c_char; N]) -> Self {
>> +        unsafe { Self::new(&mut a[0] as *mut _ as _, N) }
>> +    }
>
> This function needs to be `unsafe`, because it never captures the
> lifetime of `a`. I can write:
>     let mut a = Box::new([0; 10]);
>     let mut writer = RawWriter::from_array(&mut *a);
>     drop(a);
>     writer.write_str("Abc"); // UAF
> Alternatively add a lifetime to `RawWriter`.

Yes, a lifetime is missing in RawWriter, thanks.

>
>> +}
>> +
>> +impl Write for RawWriter {
>> +    fn write_str(&mut self, s: &str) -> fmt::Result {
>> +        let bytes = s.as_bytes();
>> +        let len = bytes.len();
>> +        if len > self.len {
>> +            return Err(fmt::Error);
>> +        }
>> +        unsafe { core::ptr::copy_nonoverlapping(&bytes[0], self.ptr, len) };
>> +        self.ptr = unsafe { self.ptr.add(len) };
>> +        self.len -= len;
>> +        Ok(())
>> +    }
>> +}
>> diff --git a/rust/kernel/block/mq/request.rs b/rust/kernel/block/mq/request.rs
>> new file mode 100644
>> index 000000000000..e95ae3fd71ad
>> --- /dev/null
>> +++ b/rust/kernel/block/mq/request.rs
>> @@ -0,0 +1,71 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +//! This module provides a wrapper for the C `struct request` type.
>> +//!
>> +//! C header: [`include/linux/blk-mq.h`](../../include/linux/blk-mq.h)
>> +
>> +use crate::{
>> +    bindings,
>> +    block::mq::Operations,
>> +    error::{Error, Result},
>> +};
>> +use core::marker::PhantomData;
>> +
>> +/// A wrapper around a blk-mq `struct request`. This represents an IO request.
>> +pub struct Request<T: Operations> {
>> +    ptr: *mut bindings::request,
>
> Why is this not embedded?

I have changed it to `struct Request(Opaque<bindings::request>)` for
next version 👍

>
>> +    _p: PhantomData<T>,
>> +}
>> +
>> +impl<T: Operations> Request<T> {
>> +    pub(crate) unsafe fn from_ptr(ptr: *mut bindings::request) -> Self {
>> +        Self {
>> +            ptr,
>> +            _p: PhantomData,
>> +        }
>> +    }
>> +
>> +    /// Get the command identifier for the request
>> +    pub fn command(&self) -> u32 {
>> +        unsafe { (*self.ptr).cmd_flags & ((1 << bindings::REQ_OP_BITS) - 1) }
>> +    }
>> +
>> +    /// Call this to indicate to the kernel that the request has been issued by the driver
>> +    pub fn start(&self) {
>> +        unsafe { bindings::blk_mq_start_request(self.ptr) };
>> +    }
>> +
>> +    /// Call this to indicate to the kernel that the request has been completed without errors
>> +    // TODO: Consume rq so that we can't use it after ending it?
>> +    pub fn end_ok(&self) {
>> +        unsafe { bindings::blk_mq_end_request(self.ptr, bindings::BLK_STS_OK as _) };
>> +    }
>> +
>> +    /// Call this to indicate to the kernel that the request completed with an error
>> +    pub fn end_err(&self, err: Error) {
>> +        unsafe { bindings::blk_mq_end_request(self.ptr, err.to_blk_status()) };
>> +    }
>> +
>> +    /// Call this to indicate that the request completed with the status indicated by `status`
>> +    pub fn end(&self, status: Result) {
>> +        if let Err(e) = status {
>> +            self.end_err(e);
>> +        } else {
>> +            self.end_ok();
>> +        }
>> +    }
>> +
>> +    /// Call this to schedule defered completion of the request
>> +    // TODO: Consume rq so that we can't use it after completing it?
>> +    pub fn complete(&self) {
>> +        if !unsafe { bindings::blk_mq_complete_request_remote(self.ptr) } {
>> +            T::complete(&unsafe { Self::from_ptr(self.ptr) });
>> +        }
>> +    }
>> +
>> +    /// Get the target sector for the request
>> +    #[inline(always)]
>
> Why is this `inline(always)`?

Compiler would not inline from kernel crate to modules without this. I
will check if this is still the case.

>
>> +    pub fn sector(&self) -> usize {
>> +        unsafe { (*self.ptr).__sector as usize }
>> +    }
>> +}
>> diff --git a/rust/kernel/block/mq/tag_set.rs b/rust/kernel/block/mq/tag_set.rs
>> new file mode 100644
>> index 000000000000..d122db7f6d0e
>> --- /dev/null
>> +++ b/rust/kernel/block/mq/tag_set.rs
>> @@ -0,0 +1,92 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +//! This module provides the `TagSet` struct to wrap the C `struct blk_mq_tag_set`.
>> +//!
>> +//! C header: [`include/linux/blk-mq.h`](../../include/linux/blk-mq.h)
>> +
>> +use crate::{
>> +    bindings,
>> +    block::mq::{operations::OperationsVtable, Operations},
>> +    error::{Error, Result},
>> +    sync::Arc,
>> +    types::ForeignOwnable,
>> +};
>> +use core::{cell::UnsafeCell, convert::TryInto, marker::PhantomData};
>> +
>> +/// A wrapper for the C `struct blk_mq_tag_set`
>> +pub struct TagSet<T: Operations> {
>> +    inner: UnsafeCell<bindings::blk_mq_tag_set>,
>> +    _p: PhantomData<T>,
>> +}
>> +
>> +impl<T: Operations> TagSet<T> {
>> +    /// Try to create a new tag set
>> +    pub fn try_new(
>> +        nr_hw_queues: u32,
>> +        tagset_data: T::TagSetData,
>> +        num_tags: u32,
>> +        num_maps: u32,
>> +    ) -> Result<Arc<Self>> {
>
> Why force the users to use `Arc`?

Changed to return a `PinInit<TagSet>` for next version.

>
>> +        let tagset = Arc::try_new(Self {
>> +            inner: UnsafeCell::new(bindings::blk_mq_tag_set::default()),
>> +            _p: PhantomData,
>> +        })?;
>> +
>> +        // SAFETY: We just allocated `tagset`, we know this is the only reference to it.
>> +        let inner = unsafe { &mut *tagset.inner.get() };
>> +
>> +        inner.ops = unsafe { OperationsVtable::<T>::build() };
>> +        inner.nr_hw_queues = nr_hw_queues;
>> +        inner.timeout = 0; // 0 means default which is 30 * HZ in C
>> +        inner.numa_node = bindings::NUMA_NO_NODE;
>> +        inner.queue_depth = num_tags;
>> +        inner.cmd_size = core::mem::size_of::<T::RequestData>().try_into()?;
>> +        inner.flags = bindings::BLK_MQ_F_SHOULD_MERGE;
>> +        inner.driver_data = tagset_data.into_foreign() as _;
>> +        inner.nr_maps = num_maps;
>> +
>> +        // SAFETY: `inner` points to valid and initialised memory.
>> +        let ret = unsafe { bindings::blk_mq_alloc_tag_set(inner) };
>> +        if ret < 0 {
>> +            // SAFETY: We created `driver_data` above with `into_foreign`
>> +            unsafe { T::TagSetData::from_foreign(inner.driver_data) };
>> +            return Err(Error::from_errno(ret));
>> +        }
>> +
>> +        Ok(tagset)
>> +    }
>> +
>> +    /// Return the pointer to the wrapped `struct blk_mq_tag_set`
>> +    pub(crate) fn raw_tag_set(&self) -> *mut bindings::blk_mq_tag_set {
>> +        self.inner.get()
>> +    }
>> +}
>> +
>> +impl<T: Operations> Drop for TagSet<T> {
>> +    fn drop(&mut self) {
>> +        let tagset_data = unsafe { (*self.inner.get()).driver_data };
>> +
>> +        // SAFETY: `inner` is valid and has been properly initialised during construction.
>> +        unsafe { bindings::blk_mq_free_tag_set(self.inner.get()) };
>> +
>> +        // SAFETY: `tagset_data` was created by a call to
>> +        // `ForeignOwnable::into_foreign` in `TagSet::try_new()`
>> +        unsafe { T::TagSetData::from_foreign(tagset_data) };
>> +    }
>> +}
>> +
>> +/// A tag set reference. Used to control lifetime and prevent drop of TagSet references passed to
>> +/// `Operations::map_queues()`
>> +pub struct TagSetRef {
>> +    ptr: *mut bindings::blk_mq_tag_set,
>> +}
>> +
>> +impl TagSetRef {
>> +    pub(crate) unsafe fn from_ptr(tagset: *mut bindings::blk_mq_tag_set) -> Self {
>> +        Self { ptr: tagset }
>> +    }
>> +
>> +    pub fn ptr(&self) -> *mut bindings::blk_mq_tag_set {
>> +        self.ptr
>> +    }
>> +}
>
> This is a **very** thin abstraction, why is it needed?

It is not. I changed it to `&TagSet`, thanks.


Thanks for the comments!

Best regards,
Andreas

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 03/11] rust: block: introduce `kernel::block::mq` module
  2023-05-11  6:52     ` Sergio González Collado
@ 2024-01-23 14:03       ` Andreas Hindborg (Samsung)
  0 siblings, 0 replies; 69+ messages in thread
From: Andreas Hindborg (Samsung) @ 2024-01-23 14:03 UTC (permalink / raw)
  To: Sergio González Collado
  Cc: Benno Lossin, Jens Axboe, Christoph Hellwig, Keith Busch,
	Damien Le Moal, Hannes Reinecke, lsf-pc, rust-for-linux,
	linux-block, Matthew Wilcox, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	linux-kernel, gost.dev


Sergio González Collado <sergio.collado@gmail.com> writes:

> +    /// Call to tell the block layer the capcacity of the device
> +    pub fn set_capacity(&self, sectors: u64) {
> +        unsafe { bindings::set_capacity(self.gendisk, sectors) };
> +    }
>
> Nit in the comment: capcacity -> capacity

Thanks!

BR Andreas


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 03/11] rust: block: introduce `kernel::block::mq` module
  2024-01-12  9:18     ` Andreas Hindborg (Samsung)
@ 2024-01-23 16:14       ` Benno Lossin
  2024-01-23 18:39         ` Andreas Hindborg (Samsung)
  0 siblings, 1 reply; 69+ messages in thread
From: Benno Lossin @ 2024-01-23 16:14 UTC (permalink / raw)
  To: Andreas Hindborg (Samsung)
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Matthew Wilcox, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, linux-kernel,
	gost.dev

Hi Andreas,

just so you know, I received this email today, so it was very late,
since the send date is January 12.

On 12.01.24 10:18, Andreas Hindborg (Samsung) wrote:
> 
> Hi Benno,
> 
> Benno Lossin <benno.lossin@proton.me> writes:
> 
> <...>
> 
>>> diff --git a/rust/kernel/block/mq/gen_disk.rs b/rust/kernel/block/mq/gen_disk.rs
>>> new file mode 100644
>>> index 000000000000..50496af15bbf
>>> --- /dev/null
>>> +++ b/rust/kernel/block/mq/gen_disk.rs
>>> @@ -0,0 +1,133 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +
>>> +//! GenDisk abstraction
>>> +//!
>>> +//! C header: [`include/linux/blkdev.h`](../../include/linux/blkdev.h)
>>> +//! C header: [`include/linux/blk_mq.h`](../../include/linux/blk_mq.h)
>>> +
>>> +use crate::block::mq::{raw_writer::RawWriter, Operations, TagSet};
>>> +use crate::{
>>> +    bindings, error::from_err_ptr, error::Result, sync::Arc, types::ForeignOwnable,
>>> +    types::ScopeGuard,
>>> +};
>>> +use core::fmt::{self, Write};
>>> +
>>> +/// A generic block device
>>> +///
>>> +/// # Invariants
>>> +///
>>> +///  - `gendisk` must always point to an initialized and valid `struct gendisk`.
>>> +pub struct GenDisk<T: Operations> {
>>> +    _tagset: Arc<TagSet<T>>,
>>> +    gendisk: *mut bindings::gendisk,
>>
>> Why are these two fields not embedded? Shouldn't the user decide where
>> to allocate?
> 
> The `TagSet` can be shared between multiple `GenDisk`. Using an `Arc`
> seems resonable?
> 
> For the `gendisk` field, the allocation is done by C and the address
> must be stable. We are owning the pointee and must drop it when it goes out
> of scope. I could do this:
> 
> #[repr(transparent)]
> struct GenDisk(Opaque<bindings::gendisk>);
> 
> struct UniqueGenDiskRef {
>      _tagset: Arc<TagSet<T>>,
>      gendisk: Pin<&'static mut GenDisk>,
> 
> }
> 
> but it seems pointless. `struct GenDisk` would not be pub in that case. What do you think?

Hmm, I am a bit confused as to how you usually use a `struct gendisk`.
You said that a `TagSet` might be shared between multiple `GenDisk`s,
but that is not facilitated by the C side?

Is it the case that on the C side you create a struct containing a
tagset and a gendisk for every block device you want to represent?
And you decided for the Rust abstractions that you want to have only a
single generic struct for any block device, distinguished by the generic
parameter?
I think these kinds of details would be nice to know. Not only for
reviewers, but also for veterans of the C APIs.

-- 
Cheers,
Benno


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 03/11] rust: block: introduce `kernel::block::mq` module
  2024-01-23 16:14       ` Benno Lossin
@ 2024-01-23 18:39         ` Andreas Hindborg (Samsung)
  2024-01-25  9:26           ` Benno Lossin
  0 siblings, 1 reply; 69+ messages in thread
From: Andreas Hindborg (Samsung) @ 2024-01-23 18:39 UTC (permalink / raw)
  To: Benno Lossin
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Matthew Wilcox, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, linux-kernel,
	gost.dev


Benno Lossin <benno.lossin@proton.me> writes:

> Hi Andreas,
>
> just so you know, I received this email today, so it was very late,
> since the send date is January 12.

My mistake. I started drafting Jan 12, but did not get time to finish
the mail until today. I guess that is how mu4e does things, I should be
aware and fix up the date. Thanks for letting me know 👍

>
> On 12.01.24 10:18, Andreas Hindborg (Samsung) wrote:
>> 
>> Hi Benno,
>> 
>> Benno Lossin <benno.lossin@proton.me> writes:
>> 
>> <...>
>> 
>>>> diff --git a/rust/kernel/block/mq/gen_disk.rs b/rust/kernel/block/mq/gen_disk.rs
>>>> new file mode 100644
>>>> index 000000000000..50496af15bbf
>>>> --- /dev/null
>>>> +++ b/rust/kernel/block/mq/gen_disk.rs
>>>> @@ -0,0 +1,133 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +
>>>> +//! GenDisk abstraction
>>>> +//!
>>>> +//! C header: [`include/linux/blkdev.h`](../../include/linux/blkdev.h)
>>>> +//! C header: [`include/linux/blk_mq.h`](../../include/linux/blk_mq.h)
>>>> +
>>>> +use crate::block::mq::{raw_writer::RawWriter, Operations, TagSet};
>>>> +use crate::{
>>>> +    bindings, error::from_err_ptr, error::Result, sync::Arc, types::ForeignOwnable,
>>>> +    types::ScopeGuard,
>>>> +};
>>>> +use core::fmt::{self, Write};
>>>> +
>>>> +/// A generic block device
>>>> +///
>>>> +/// # Invariants
>>>> +///
>>>> +///  - `gendisk` must always point to an initialized and valid `struct gendisk`.
>>>> +pub struct GenDisk<T: Operations> {
>>>> +    _tagset: Arc<TagSet<T>>,
>>>> +    gendisk: *mut bindings::gendisk,
>>>
>>> Why are these two fields not embedded? Shouldn't the user decide where
>>> to allocate?
>> 
>> The `TagSet` can be shared between multiple `GenDisk`. Using an `Arc`
>> seems resonable?
>> 
>> For the `gendisk` field, the allocation is done by C and the address
>> must be stable. We are owning the pointee and must drop it when it goes out
>> of scope. I could do this:
>> 
>> #[repr(transparent)]
>> struct GenDisk(Opaque<bindings::gendisk>);
>> 
>> struct UniqueGenDiskRef {
>>      _tagset: Arc<TagSet<T>>,
>>      gendisk: Pin<&'static mut GenDisk>,
>> 
>> }
>> 
>> but it seems pointless. `struct GenDisk` would not be pub in that case. What do you think?
>
> Hmm, I am a bit confused as to how you usually use a `struct gendisk`.
> You said that a `TagSet` might be shared between multiple `GenDisk`s,
> but that is not facilitated by the C side?
>
> Is it the case that on the C side you create a struct containing a
> tagset and a gendisk for every block device you want to represent?

Yes, but the `struct tag_set` can be shared between multiple `struct
gendisk`.

Let me try to elaborate:

In C you would first allocate a `struct tag_set` and partially
initialize it. The allocation can be dynamic, static or part of existing
allocation. You would then partially initialize the structure and finish
the initialization by calling `blk_mq_alloc_tag_set()`. This populates
the rest of the structure which includes more dynamic allocations.

You then allocate a `struct gendisk` by calling `blk_mq_alloc_disk()`,
passing in a pointer to the `struct tag_set` you just created. This
function will return a pointer to a `struct gendisk` on success.

In the Rust abstractions, we allocate the `TagSet`:

#[pin_data(PinnedDrop)]
#[repr(transparent)]
pub struct TagSet<T: Operations> {
    #[pin]
    inner: Opaque<bindings::blk_mq_tag_set>,
    _p: PhantomData<T>,
}

with `PinInit` [^1]. The initializer will partially initialize the struct and
finish the initialization like C does by calling
`blk_mq_alloc_tag_set()`. We now need a place to point the initializer.
`Arc::pin_init()` is that place for now. It allows us to pass the
`TagSet` reference to multiple `GenDisk` if required. Maybe we could be
generic over `Deref<TagSet>` in the future. Bottom line is that we need
to hold on to that `TagSet` reference until the `GenDisk` is dropped.

`struct tag_set` is not reference counted on the C side. C
implementations just take care to keep it alive, for instance by storing
it next to a pointer to `struct gendisk` that it is servicing.

> And you decided for the Rust abstractions that you want to have only a
> single generic struct for any block device, distinguished by the generic
> parameter?

Yes, we have a single generic struct (`GenDisk`) representing the C
`struct gendisk`, and a single generic struct (`TagSet`) representing
the C `struct tag_set`. These are both generic over `T: Operations`.
`Operations` represent a C vtable (`struct blk_mq_ops`) attached to the
`struct tag_set`. This vtable is provided by the driver and holds
function pointers that allow the kernel to perform actions such as queue
IO requests with the driver. A C driver can instantiate multiple `struct
gendisk` and service them with the same `struct tag_set` and thereby the
same vtable. Or it can use separate tag sets and the same vtable. Or a
separate tag_set and vtable for each gendisk.

> I think these kinds of details would be nice to know. Not only for
> reviewers, but also for veterans of the C APIs.

I should write some module level documentation clarifying the use of
these types. The null block driver is a simple example, but it is just
code. I will include more docs in the next version.

Best regards
Andreas


[^1]: This was not `PinInit` in the RFC, I changed this based on your
feedback. The main points are still the same though.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 03/11] rust: block: introduce `kernel::block::mq` module
  2024-01-23 18:39         ` Andreas Hindborg (Samsung)
@ 2024-01-25  9:26           ` Benno Lossin
  2024-01-29 14:14             ` Andreas Hindborg (Samsung)
  0 siblings, 1 reply; 69+ messages in thread
From: Benno Lossin @ 2024-01-25  9:26 UTC (permalink / raw)
  To: Andreas Hindborg (Samsung)
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Matthew Wilcox, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, linux-kernel,
	gost.dev

On 23.01.24 19:39, Andreas Hindborg (Samsung) wrote:
>>>>> +/// A generic block device
>>>>> +///
>>>>> +/// # Invariants
>>>>> +///
>>>>> +///  - `gendisk` must always point to an initialized and valid `struct gendisk`.
>>>>> +pub struct GenDisk<T: Operations> {
>>>>> +    _tagset: Arc<TagSet<T>>,
>>>>> +    gendisk: *mut bindings::gendisk,
>>>>
>>>> Why are these two fields not embedded? Shouldn't the user decide where
>>>> to allocate?
>>>
>>> The `TagSet` can be shared between multiple `GenDisk`. Using an `Arc`
>>> seems resonable?
>>>
>>> For the `gendisk` field, the allocation is done by C and the address
>>> must be stable. We are owning the pointee and must drop it when it goes out
>>> of scope. I could do this:
>>>
>>> #[repr(transparent)]
>>> struct GenDisk(Opaque<bindings::gendisk>);
>>>
>>> struct UniqueGenDiskRef {
>>>       _tagset: Arc<TagSet<T>>,
>>>       gendisk: Pin<&'static mut GenDisk>,
>>>
>>> }
>>>
>>> but it seems pointless. `struct GenDisk` would not be pub in that case. What do you think?
>>
>> Hmm, I am a bit confused as to how you usually use a `struct gendisk`.
>> You said that a `TagSet` might be shared between multiple `GenDisk`s,
>> but that is not facilitated by the C side?
>>
>> Is it the case that on the C side you create a struct containing a
>> tagset and a gendisk for every block device you want to represent?
> 
> Yes, but the `struct tag_set` can be shared between multiple `struct
> gendisk`.
> 
> Let me try to elaborate:
> 
> In C you would first allocate a `struct tag_set` and partially
> initialize it. The allocation can be dynamic, static or part of existing
> allocation. You would then partially initialize the structure and finish
> the initialization by calling `blk_mq_alloc_tag_set()`. This populates
> the rest of the structure which includes more dynamic allocations.
> 
> You then allocate a `struct gendisk` by calling `blk_mq_alloc_disk()`,
> passing in a pointer to the `struct tag_set` you just created. This
> function will return a pointer to a `struct gendisk` on success.
> 
> In the Rust abstractions, we allocate the `TagSet`:
> 
> #[pin_data(PinnedDrop)]
> #[repr(transparent)]
> pub struct TagSet<T: Operations> {
>      #[pin]
>      inner: Opaque<bindings::blk_mq_tag_set>,
>      _p: PhantomData<T>,
> }
> 
> with `PinInit` [^1]. The initializer will partially initialize the struct and
> finish the initialization like C does by calling
> `blk_mq_alloc_tag_set()`. We now need a place to point the initializer.
> `Arc::pin_init()` is that place for now. It allows us to pass the
> `TagSet` reference to multiple `GenDisk` if required. Maybe we could be
> generic over `Deref<TagSet>` in the future. Bottom line is that we need
> to hold on to that `TagSet` reference until the `GenDisk` is dropped.

I see, thanks for the elaborate explanation! I now think that using `Arc`
makes sense.

> `struct tag_set` is not reference counted on the C side. C
> implementations just take care to keep it alive, for instance by storing
> it next to a pointer to `struct gendisk` that it is servicing.

This is interesting, is this also done in the case where it is shared
among multiple `struct gendisk`s?
Does this have some deeper reason? Or am I right to assume that creating
`Gendisk`/`TagSet` is done rarely (i.e. only at initialization of the
driver)?

>> And you decided for the Rust abstractions that you want to have only a
>> single generic struct for any block device, distinguished by the generic
>> parameter?
> 
> Yes, we have a single generic struct (`GenDisk`) representing the C
> `struct gendisk`, and a single generic struct (`TagSet`) representing
> the C `struct tag_set`. These are both generic over `T: Operations`.
> `Operations` represent a C vtable (`struct blk_mq_ops`) attached to the
> `struct tag_set`. This vtable is provided by the driver and holds
> function pointers that allow the kernel to perform actions such as queue
> IO requests with the driver. A C driver can instantiate multiple `struct
> gendisk` and service them with the same `struct tag_set` and thereby the
> same vtable. Or it can use separate tag sets and the same vtable. Or a
> separate tag_set and vtable for each gendisk.
> 
>> I think these kinds of details would be nice to know. Not only for
>> reviewers, but also for veterans of the C APIs.
> 
> I should write some module level documentation clarifying the use of
> these types. The null block driver is a simple example, but it is just
> code. I will include more docs in the next version.

Thanks a lot for explaining!

-- 
Cheers,
Benno



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 03/11] rust: block: introduce `kernel::block::mq` module
  2024-01-25  9:26           ` Benno Lossin
@ 2024-01-29 14:14             ` Andreas Hindborg (Samsung)
  0 siblings, 0 replies; 69+ messages in thread
From: Andreas Hindborg (Samsung) @ 2024-01-29 14:14 UTC (permalink / raw)
  To: Benno Lossin
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Matthew Wilcox, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, linux-kernel,
	gost.dev


Benno Lossin <benno.lossin@proton.me> writes:

> On 23.01.24 19:39, Andreas Hindborg (Samsung) wrote:
>>>>>> +/// A generic block device
>>>>>> +///
>>>>>> +/// # Invariants
>>>>>> +///
>>>>>> +///  - `gendisk` must always point to an initialized and valid `struct gendisk`.
>>>>>> +pub struct GenDisk<T: Operations> {
>>>>>> +    _tagset: Arc<TagSet<T>>,
>>>>>> +    gendisk: *mut bindings::gendisk,
>>>>>
>>>>> Why are these two fields not embedded? Shouldn't the user decide where
>>>>> to allocate?
>>>>
>>>> The `TagSet` can be shared between multiple `GenDisk`. Using an `Arc`
>>>> seems resonable?
>>>>
>>>> For the `gendisk` field, the allocation is done by C and the address
>>>> must be stable. We are owning the pointee and must drop it when it goes out
>>>> of scope. I could do this:
>>>>
>>>> #[repr(transparent)]
>>>> struct GenDisk(Opaque<bindings::gendisk>);
>>>>
>>>> struct UniqueGenDiskRef {
>>>>       _tagset: Arc<TagSet<T>>,
>>>>       gendisk: Pin<&'static mut GenDisk>,
>>>>
>>>> }
>>>>
>>>> but it seems pointless. `struct GenDisk` would not be pub in that case. What do you think?
>>>
>>> Hmm, I am a bit confused as to how you usually use a `struct gendisk`.
>>> You said that a `TagSet` might be shared between multiple `GenDisk`s,
>>> but that is not facilitated by the C side?
>>>
>>> Is it the case that on the C side you create a struct containing a
>>> tagset and a gendisk for every block device you want to represent?
>> 
>> Yes, but the `struct tag_set` can be shared between multiple `struct
>> gendisk`.
>> 
>> Let me try to elaborate:
>> 
>> In C you would first allocate a `struct tag_set` and partially
>> initialize it. The allocation can be dynamic, static or part of existing
>> allocation. You would then partially initialize the structure and finish
>> the initialization by calling `blk_mq_alloc_tag_set()`. This populates
>> the rest of the structure which includes more dynamic allocations.
>> 
>> You then allocate a `struct gendisk` by calling `blk_mq_alloc_disk()`,
>> passing in a pointer to the `struct tag_set` you just created. This
>> function will return a pointer to a `struct gendisk` on success.
>> 
>> In the Rust abstractions, we allocate the `TagSet`:
>> 
>> #[pin_data(PinnedDrop)]
>> #[repr(transparent)]
>> pub struct TagSet<T: Operations> {
>>      #[pin]
>>      inner: Opaque<bindings::blk_mq_tag_set>,
>>      _p: PhantomData<T>,
>> }
>> 
>> with `PinInit` [^1]. The initializer will partially initialize the struct and
>> finish the initialization like C does by calling
>> `blk_mq_alloc_tag_set()`. We now need a place to point the initializer.
>> `Arc::pin_init()` is that place for now. It allows us to pass the
>> `TagSet` reference to multiple `GenDisk` if required. Maybe we could be
>> generic over `Deref<TagSet>` in the future. Bottom line is that we need
>> to hold on to that `TagSet` reference until the `GenDisk` is dropped.
>
> I see, thanks for the elaborate explanation! I now think that using `Arc`
> makes sense.
>
>> `struct tag_set` is not reference counted on the C side. C
>> implementations just take care to keep it alive, for instance by storing
>> it next to a pointer to `struct gendisk` that it is servicing.
>
> This is interesting, is this also done in the case where it is shared
> among multiple `struct gendisk`s?

Yes. The architecture is really quite flexible. For instance C NVMe uses
one tag set for the admin queue of a nvme controller, and one tag set
shared for all IO queues of a controller. The admin queue tag set is not
actually attached to a `struct gendisk` and does not appear as a block
device to the user. The IO queue tag set serves a number `struct
gendisk`, one for each name space of the controller.

> Does this have some deeper reason? Or am I right to assume that creating
> `Gendisk`/`TagSet` is done rarely (i.e. only at initialization of the
> driver)?

I am not sure. It could probably be reference counted on C side. Perhaps
nobody felt the need for it, since the lifetime of it is not that
complex. And yes, it is relatively rare as it is only as part of
initialization and tear down that you would create or destroy this
structure.

Best regards,
Andreas


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 06/11] rust: apply cache line padding for `SpinLock`
  2023-05-03 12:03   ` Alice Ryhl
@ 2024-02-23 11:29     ` Andreas Hindborg (Samsung)
  2024-02-26  9:15       ` Alice Ryhl
  0 siblings, 1 reply; 69+ messages in thread
From: Andreas Hindborg (Samsung) @ 2024-02-23 11:29 UTC (permalink / raw)
  To: Alice Ryhl
  Cc: Damien.LeMoal, alex.gaynor, axboe, benno.lossin, bjorn3_gh,
	boqun.feng, gary, gost.dev, hare, hch, kbusch, linux-block,
	linux-kernel, lsf-pc, ojeda, rust-for-linux, wedsonaf, willy


Hi Alice,

Alice Ryhl <aliceryhl@google.com> writes:

> On Wed, 3 May 2023 11:07:03 +0200, Andreas Hindborg <a.hindborg@samsung.com> wrote:
>> The kernel `struct spinlock` is 4 bytes on x86 when lockdep is not enabled. The
>> structure is not padded to fit a cache line. The effect of this for `SpinLock`
>> is that the lock variable and the value protected by the lock will share a cache
>> line, depending on the alignment requirements of the protected value. Aligning
>> the lock variable and the protected value to a cache line yields a 20%
>> performance increase for the Rust null block driver for sequential reads to
>> memory backed devices at 6 concurrent readers.
>> 
>> Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com>
>
> This applies the cacheline padding to all spinlocks unconditionally.
> It's not clear to me that we want to do that. Instead, I suggest using
> `SpinLock<CachePadded<T>>` in the null block driver to opt-in to the
> cache padding there, and let other drivers choose whether or not they
> want to cache pad their locks.

I was going to write that this is not going to work because the compiler
is going to reorder the fields of `Lock` and put the `data` field first,
followed by the `state` field. But I checked the layout, and it seems
that I actually get the `state` field first (with an alignment of 4), 60
bytes of padding, and then the `data` field (with alignment 64).

I am wondering why the compiler is not reordering these fields? Am I
guaranteed that the fields will not be reordered? Looking at the
definition of `Lock` there does not seem to be anything that prevents
rustc from swapping `state` and `data`.

>
> On Wed, 3 May 2023 11:07:03 +0200, Andreas Hindborg <a.hindborg@samsung.com> wrote:
>> diff --git a/rust/kernel/cache_padded.rs b/rust/kernel/cache_padded.rs
>> new file mode 100644
>> index 000000000000..758678e71f50
>> --- /dev/null
>> +++ b/rust/kernel/cache_padded.rs
>> 
>> +impl<T> CachePadded<T> {
>> +    /// Pads and aligns a value to 64 bytes.
>> +    #[inline(always)]
>> +    pub(crate) const fn new(t: T) -> CachePadded<T> {
>> +        CachePadded::<T> { value: t }
>> +    }
>> +}
>
> Please make this `pub` instead of just `pub(crate)`. Other drivers might
> want to use this directly.

Alright.

>
> On Wed, 3 May 2023 11:07:03 +0200, Andreas Hindborg <a.hindborg@samsung.com> wrote:
>> diff --git a/rust/kernel/sync/lock/spinlock.rs b/rust/kernel/sync/lock/spinlock.rs
>> index 979b56464a4e..e39142a8148c 100644
>> --- a/rust/kernel/sync/lock/spinlock.rs
>> +++ b/rust/kernel/sync/lock/spinlock.rs
>> @@ -100,18 +103,20 @@ unsafe impl super::Backend for SpinLockBackend {
>>      ) {
>>          // SAFETY: The safety requirements ensure that `ptr` is valid for writes, and `name` and
>>          // `key` are valid for read indefinitely.
>> -        unsafe { bindings::__spin_lock_init(ptr, name, key) }
>> +        unsafe { bindings::__spin_lock_init((&mut *ptr).deref_mut(), name, key) }
>>      }
>>  
>> +    #[inline(always)]
>>      unsafe fn lock(ptr: *mut Self::State) -> Self::GuardState {
>>          // SAFETY: The safety requirements of this function ensure that `ptr` points to valid
>>          // memory, and that it has been initialised before.
>> -        unsafe { bindings::spin_lock(ptr) }
>> +        unsafe { bindings::spin_lock((&mut *ptr).deref_mut()) }
>>      }
>>  
>> +    #[inline(always)]
>>      unsafe fn unlock(ptr: *mut Self::State, _guard_state: &Self::GuardState) {
>>          // SAFETY: The safety requirements of this function ensure that `ptr` is valid and that the
>>          // caller is the owner of the mutex.
>> -        unsafe { bindings::spin_unlock(ptr) }
>> +        unsafe { bindings::spin_unlock((&mut *ptr).deref_mut()) }
>>      }
>>  }
>
> I would prefer to remain in pointer-land for the above operations. I
> think that this leads to core that is more obviously correct.
>
> For example:
>
> ```
> impl<T> CachePadded<T> {
>     pub const fn raw_get(ptr: *mut Self) -> *mut T {
>         core::ptr::addr_of_mut!((*ptr).value)
>     }
> }
>
> #[inline(always)]
> unsafe fn unlock(ptr: *mut Self::State, _guard_state: &Self::GuardState) {
>     unsafe { bindings::spin_unlock(CachePadded::raw_get(ptr)) }
> }
> ```

Got it 👍


BR Andreas

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 06/11] rust: apply cache line padding for `SpinLock`
  2024-02-23 11:29     ` Andreas Hindborg (Samsung)
@ 2024-02-26  9:15       ` Alice Ryhl
  0 siblings, 0 replies; 69+ messages in thread
From: Alice Ryhl @ 2024-02-26  9:15 UTC (permalink / raw)
  To: Andreas Hindborg (Samsung)
  Cc: Damien.LeMoal, alex.gaynor, axboe, benno.lossin, bjorn3_gh,
	boqun.feng, gary, gost.dev, hare, hch, kbusch, linux-block,
	linux-kernel, lsf-pc, ojeda, rust-for-linux, wedsonaf, willy

On Mon, Feb 26, 2024 at 10:02 AM Andreas Hindborg (Samsung)
<nmi@metaspace.dk> wrote:
>
>
> Hi Alice,
>
> Alice Ryhl <aliceryhl@google.com> writes:
>
> > On Wed, 3 May 2023 11:07:03 +0200, Andreas Hindborg <a.hindborg@samsung.com> wrote:
> >> The kernel `struct spinlock` is 4 bytes on x86 when lockdep is not enabled. The
> >> structure is not padded to fit a cache line. The effect of this for `SpinLock`
> >> is that the lock variable and the value protected by the lock will share a cache
> >> line, depending on the alignment requirements of the protected value. Aligning
> >> the lock variable and the protected value to a cache line yields a 20%
> >> performance increase for the Rust null block driver for sequential reads to
> >> memory backed devices at 6 concurrent readers.
> >>
> >> Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com>
> >
> > This applies the cacheline padding to all spinlocks unconditionally.
> > It's not clear to me that we want to do that. Instead, I suggest using
> > `SpinLock<CachePadded<T>>` in the null block driver to opt-in to the
> > cache padding there, and let other drivers choose whether or not they
> > want to cache pad their locks.
>
> I was going to write that this is not going to work because the compiler
> is going to reorder the fields of `Lock` and put the `data` field first,
> followed by the `state` field. But I checked the layout, and it seems
> that I actually get the `state` field first (with an alignment of 4), 60
> bytes of padding, and then the `data` field (with alignment 64).
>
> I am wondering why the compiler is not reordering these fields? Am I
> guaranteed that the fields will not be reordered? Looking at the
> definition of `Lock` there does not seem to be anything that prevents
> rustc from swapping `state` and `data`.

It's because `Lock` has `: ?Sized` on the `T` generic. Fields that
might not be Sized must always be last.

Alice

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 04/11] rust: block: introduce `kernel::block::bio` module
  2024-01-11 12:49     ` Andreas Hindborg (Samsung)
@ 2024-02-28 14:31       ` Andreas Hindborg
  2024-03-09 12:30         ` Benno Lossin
  0 siblings, 1 reply; 69+ messages in thread
From: Andreas Hindborg @ 2024-02-28 14:31 UTC (permalink / raw)
  To: Benno Lossin
  Cc: Andreas Hindborg, Jens Axboe, Christoph Hellwig, Keith Busch,
	Damien Le Moal, Hannes Reinecke, lsf-pc, rust-for-linux,
	linux-block, Matthew Wilcox, Miguel Ojeda, Alex Gaynor,
	Wedson Almeida Filho, Boqun Feng, Gary Guo, Björn Roy Baron,
	linux-kernel, gost.dev


Hi Benno,

"Andreas Hindborg (Samsung)" <nmi@metaspace.dk> writes:

<cut>

>>> +);
>>> +
>>> +impl<'a> Bio<'a> {
>>> +    /// Returns an iterator over segments in this `Bio`. Does not consider
>>> +    /// segments of other bios in this bio chain.
>>> +    #[inline(always)]
>>
>> Why are these `inline(always)`? The compiler should inline them
>> automatically?
>
> No, the compiler would not inline into modules without them. I'll check
> again if that is still the case.

I just tested this again. If I remove the attribute, the compiler will
inline some of the functions but not others. I guess it depends on the
inlining heuristics of rustc. The majority of the attributes I have put
is not necessary, since the compiler will inline by default. But for
instance `<BioIterator as Iterator>::next` is not inlined by default and
it really should be inlined.

Since most of the attributes do not change compiler default behavior, I
would rather tag all functions that I want inlined than have to
disassemble build outputs to check which functions actually need the
attribute. With this approach, we are not affected by changes to
compiler heuristics either.

What do you think?

Best regards,
Andreas


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [RFC PATCH 04/11] rust: block: introduce `kernel::block::bio` module
  2024-02-28 14:31       ` Andreas Hindborg
@ 2024-03-09 12:30         ` Benno Lossin
  0 siblings, 0 replies; 69+ messages in thread
From: Benno Lossin @ 2024-03-09 12:30 UTC (permalink / raw)
  To: Andreas Hindborg
  Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Damien Le Moal,
	Hannes Reinecke, lsf-pc, rust-for-linux, linux-block,
	Matthew Wilcox, Miguel Ojeda, Alex Gaynor, Wedson Almeida Filho,
	Boqun Feng, Gary Guo, Björn Roy Baron, linux-kernel,
	gost.dev

On 2/28/24 15:31, Andreas Hindborg wrote:
> 
> Hi Benno,
> 
> "Andreas Hindborg (Samsung)" <nmi@metaspace.dk> writes:
> 
> <cut>
> 
>>>> +);
>>>> +
>>>> +impl<'a> Bio<'a> {
>>>> +    /// Returns an iterator over segments in this `Bio`. Does not consider
>>>> +    /// segments of other bios in this bio chain.
>>>> +    #[inline(always)]
>>>
>>> Why are these `inline(always)`? The compiler should inline them
>>> automatically?
>>
>> No, the compiler would not inline into modules without them. I'll check
>> again if that is still the case.
> 
> I just tested this again. If I remove the attribute, the compiler will
> inline some of the functions but not others. I guess it depends on the
> inlining heuristics of rustc. The majority of the attributes I have put
> is not necessary, since the compiler will inline by default. But for
> instance `<BioIterator as Iterator>::next` is not inlined by default and
> it really should be inlined.
> 
> Since most of the attributes do not change compiler default behavior, I
> would rather tag all functions that I want inlined than have to
> disassemble build outputs to check which functions actually need the
> attribute. With this approach, we are not affected by changes to
> compiler heuristics either.
> 
> What do you think?

I think that you should do whatever leads to the best results in
practice. I know that the compiler developers spend considerable time
coming up with smart algorithms for deciding when and when not to inline
functions. But they aren't perfect, so if you find them necessary then
please add them.
What I want to avoid is that we end up tagging every function
`inline(always)`, or at least we do not check if it makes a difference.

-- 
Cheers,
Benno


^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2024-03-09 12:31 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-03  9:06 [RFC PATCH 00/11] Rust null block driver Andreas Hindborg
2023-05-03  9:06 ` [RFC PATCH 01/11] rust: add radix tree abstraction Andreas Hindborg
2023-05-03 10:34   ` Benno Lossin
2023-05-05  4:04   ` Matthew Wilcox
2023-05-05  4:49     ` Andreas Hindborg
2023-05-05  5:28       ` Matthew Wilcox
2023-05-05  6:09         ` Christoph Hellwig
2023-05-05  8:33           ` Chaitanya Kulkarni
2023-05-03  9:06 ` [RFC PATCH 02/11] rust: add `pages` module for handling page allocation Andreas Hindborg
2023-05-03 12:31   ` Benno Lossin
2023-05-03 12:38     ` Benno Lossin
2023-05-05  4:09   ` Matthew Wilcox
2023-05-05  4:42     ` Andreas Hindborg
2023-05-05  5:29       ` Matthew Wilcox
2023-05-03  9:07 ` [RFC PATCH 03/11] rust: block: introduce `kernel::block::mq` module Andreas Hindborg
2023-05-08 12:29   ` Benno Lossin
2023-05-11  6:52     ` Sergio González Collado
2024-01-23 14:03       ` Andreas Hindborg (Samsung)
2024-01-12  9:18     ` Andreas Hindborg (Samsung)
2024-01-23 16:14       ` Benno Lossin
2024-01-23 18:39         ` Andreas Hindborg (Samsung)
2024-01-25  9:26           ` Benno Lossin
2024-01-29 14:14             ` Andreas Hindborg (Samsung)
2023-05-03  9:07 ` [RFC PATCH 04/11] rust: block: introduce `kernel::block::bio` module Andreas Hindborg
2023-05-08 12:58   ` Benno Lossin
2024-01-11 12:49     ` Andreas Hindborg (Samsung)
2024-02-28 14:31       ` Andreas Hindborg
2024-03-09 12:30         ` Benno Lossin
2023-05-03  9:07 ` [RFC PATCH 05/11] RUST: add `module_params` macro Andreas Hindborg
2023-05-03  9:07 ` [RFC PATCH 06/11] rust: apply cache line padding for `SpinLock` Andreas Hindborg
2023-05-03 12:03   ` Alice Ryhl
2024-02-23 11:29     ` Andreas Hindborg (Samsung)
2024-02-26  9:15       ` Alice Ryhl
2023-05-03  9:07 ` [RFC PATCH 07/11] rust: lock: add support for `Lock::lock_irqsave` Andreas Hindborg
2023-05-03  9:07 ` [RFC PATCH 08/11] rust: lock: implement `IrqSaveBackend` for `SpinLock` Andreas Hindborg
2023-05-03  9:07 ` [RFC PATCH 09/11] RUST: implement `ForeignOwnable` for `Pin` Andreas Hindborg
2023-05-03  9:07 ` [RFC PATCH 10/11] rust: add null block driver Andreas Hindborg
2023-05-03  9:07 ` [RFC PATCH 11/11] rust: inline a number of short functions Andreas Hindborg
2023-05-03 11:32 ` [RFC PATCH 00/11] Rust null block driver Niklas Cassel
2023-05-03 12:29   ` Andreas Hindborg
2023-05-03 13:54     ` Niklas Cassel
2023-05-03 16:47 ` Bart Van Assche
2023-05-04 18:15   ` Andreas Hindborg
2023-05-04 18:36     ` Bart Van Assche
2023-05-04 18:46       ` Andreas Hindborg
2023-05-04 18:52       ` Keith Busch
2023-05-04 19:02         ` Jens Axboe
2023-05-04 19:59           ` Andreas Hindborg
2023-05-04 20:55             ` Jens Axboe
2023-05-05  5:06               ` Andreas Hindborg
2023-05-05 11:14               ` Miguel Ojeda
2023-05-04 20:11           ` Miguel Ojeda
2023-05-04 20:22             ` Jens Axboe
2023-05-05 10:53               ` Miguel Ojeda
2023-05-05 12:24                 ` Boqun Feng
2023-05-05 13:52                   ` Boqun Feng
2023-05-05 19:42                   ` Keith Busch
2023-05-05 21:46                     ` Boqun Feng
2023-05-05 19:38                 ` Bart Van Assche
2023-05-05  3:52           ` Christoph Hellwig
2023-06-06 13:33           ` Andreas Hindborg (Samsung)
2023-06-06 14:46             ` Miguel Ojeda
2023-05-05  5:28       ` Hannes Reinecke
2023-05-07 23:31 ` Luis Chamberlain
2023-05-07 23:37   ` Andreas Hindborg
2023-07-27  3:45 ` Yexuan Yang
2023-07-27  3:47 ` Yexuan Yang
     [not found] ` <2B3CA5F1CCCFEAB2+20230727034517.GB126117@1182282462>
2023-07-28  6:49   ` Andreas Hindborg (Samsung)
2023-07-31 14:14     ` Andreas Hindborg (Samsung)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).