All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] btrfs: continue readahead of siblings even if target node is in memory
@ 2021-07-20 15:03 fdmanana
  2021-07-20 23:57 ` Qu Wenruo
  2021-07-21 15:16 ` David Sterba
  0 siblings, 2 replies; 3+ messages in thread
From: fdmanana @ 2021-07-20 15:03 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

At reada_for_search(), when attempting to readahead a node or leaf's
siblings, we skip the readahead of the siblings if the node/leaf is
already in memory. That is probably fine for the READA_FORWARD and
READA_BACK readahead types, as they are used on contextes where we
end up reading some consecutive leaves, but usually not the whole btree.

However for a READA_FORWARD_ALWAYS mode, currently only used for full
send operations, it does not make sense to skip the readahead if the
target node or leaf is already loaded in memory, since we know the caller
is visiting every node and leaf of the btree in ascending order.

So change the behaviour to not skip the readahead when the target node is
already in memory and the readahead mode is READA_FORWARD_ALWAYS.

The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk, with 32GiB of RAM and
using a non-debug kernel config (Debian's default config).

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/sdj
  MNT=/mnt/sdj
  MKFS_OPTIONS="--nodesize 16384"     # default, just to be explicit
  MOUNT_OPTIONS="-o max_inline=2048"  # default, just to be explicit

  mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
  mount $MOUNT_OPTIONS $DEV $MNT

  # Create files with inline data to make it easier and faster to create
  # large btrees.
  add_files()
  {
      local total=$1
      local start_offset=$2
      local number_jobs=$3
      local total_per_job=$(($total / $number_jobs))

      echo "Creating $total new files using $number_jobs jobs"
      for ((n = 0; n < $number_jobs; n++)); do
          (
              local start_num=$(($start_offset + $n * $total_per_job))
              for ((i = 1; i <= $total_per_job; i++)); do
                  local file_num=$((start_num + $i))
                  local file_path="$MNT/file_${file_num}"
                  xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
                  if [ $? -ne 0 ]; then
                      echo "Failed creating file $file_path"
                      break
                  fi
              done
          ) &
          worker_pids[$n]=$!
      done

      wait ${worker_pids[@]}

      sync
      echo
      echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
  }

  file_count=2000000
  add_files $file_count 0 4

  echo
  echo "Creating snapshot..."
  btrfs subvolume snapshot -r $MNT $MNT/snap1

  umount $MNT

  echo 3 > /proc/sys/vm/drop_caches
  blockdev --flushbufs $DEV &> /dev/null
  hdparm -F $DEV &> /dev/null

  mount $MOUNT_OPTIONS $DEV $MNT

  echo
  echo "Testing full send..."
  start=$(date +%s)
  btrfs send $MNT/snap1 > /dev/null
  end=$(date +%s)
  echo
  echo "Full send took $((end - start)) seconds"

  umount $MNT

The duration of the full send operations, in seconds, were the following:

Before this change:  85 seconds
After this change:   76 seconds (-11.2%)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ctree.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index c212f1218fdd..63c026495193 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1233,7 +1233,6 @@ static void reada_for_search(struct btrfs_fs_info *fs_info,
 	u64 target;
 	u64 nread = 0;
 	u64 nread_max;
-	struct extent_buffer *eb;
 	u32 nr;
 	u32 blocksize;
 	u32 nscan = 0;
@@ -1262,10 +1261,14 @@ static void reada_for_search(struct btrfs_fs_info *fs_info,
 
 	search = btrfs_node_blockptr(node, slot);
 	blocksize = fs_info->nodesize;
-	eb = find_extent_buffer(fs_info, search);
-	if (eb) {
-		free_extent_buffer(eb);
-		return;
+	if (path->reada != READA_FORWARD_ALWAYS) {
+		struct extent_buffer *eb;
+
+		eb = find_extent_buffer(fs_info, search);
+		if (eb) {
+			free_extent_buffer(eb);
+			return;
+		}
 	}
 
 	target = search;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] btrfs: continue readahead of siblings even if target node is in memory
  2021-07-20 15:03 [PATCH] btrfs: continue readahead of siblings even if target node is in memory fdmanana
@ 2021-07-20 23:57 ` Qu Wenruo
  2021-07-21 15:16 ` David Sterba
  1 sibling, 0 replies; 3+ messages in thread
From: Qu Wenruo @ 2021-07-20 23:57 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



On 2021/7/20 下午11:03, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> At reada_for_search(), when attempting to readahead a node or leaf's
> siblings, we skip the readahead of the siblings if the node/leaf is
> already in memory. That is probably fine for the READA_FORWARD and
> READA_BACK readahead types, as they are used on contextes where we
> end up reading some consecutive leaves, but usually not the whole btree.
>
> However for a READA_FORWARD_ALWAYS mode, currently only used for full
> send operations, it does not make sense to skip the readahead if the
> target node or leaf is already loaded in memory, since we know the caller
> is visiting every node and leaf of the btree in ascending order.
>
> So change the behaviour to not skip the readahead when the target node is
> already in memory and the readahead mode is READA_FORWARD_ALWAYS.
>
> The following test script was used to measure the improvement on a box
> using an average, consumer grade, spinning disk, with 32GiB of RAM and
> using a non-debug kernel config (Debian's default config).
>
>    $ cat test.sh
>    #!/bin/bash
>
>    DEV=/dev/sdj
>    MNT=/mnt/sdj
>    MKFS_OPTIONS="--nodesize 16384"     # default, just to be explicit
>    MOUNT_OPTIONS="-o max_inline=2048"  # default, just to be explicit
>
>    mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
>    mount $MOUNT_OPTIONS $DEV $MNT
>
>    # Create files with inline data to make it easier and faster to create
>    # large btrees.
>    add_files()
>    {
>        local total=$1
>        local start_offset=$2
>        local number_jobs=$3
>        local total_per_job=$(($total / $number_jobs))
>
>        echo "Creating $total new files using $number_jobs jobs"
>        for ((n = 0; n < $number_jobs; n++)); do
>            (
>                local start_num=$(($start_offset + $n * $total_per_job))
>                for ((i = 1; i <= $total_per_job; i++)); do
>                    local file_num=$((start_num + $i))
>                    local file_path="$MNT/file_${file_num}"
>                    xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
>                    if [ $? -ne 0 ]; then
>                        echo "Failed creating file $file_path"
>                        break
>                    fi
>                done
>            ) &
>            worker_pids[$n]=$!
>        done
>
>        wait ${worker_pids[@]}
>
>        sync
>        echo
>        echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
>    }
>
>    file_count=2000000
>    add_files $file_count 0 4
>
>    echo
>    echo "Creating snapshot..."
>    btrfs subvolume snapshot -r $MNT $MNT/snap1
>
>    umount $MNT
>
>    echo 3 > /proc/sys/vm/drop_caches
>    blockdev --flushbufs $DEV &> /dev/null
>    hdparm -F $DEV &> /dev/null
>
>    mount $MOUNT_OPTIONS $DEV $MNT
>
>    echo
>    echo "Testing full send..."
>    start=$(date +%s)
>    btrfs send $MNT/snap1 > /dev/null
>    end=$(date +%s)
>    echo
>    echo "Full send took $((end - start)) seconds"
>
>    umount $MNT
>
> The duration of the full send operations, in seconds, were the following:
>
> Before this change:  85 seconds
> After this change:   76 seconds (-11.2%)
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Just one off-topic idea inlined below:

> ---
>   fs/btrfs/ctree.c | 13 ++++++++-----
>   1 file changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index c212f1218fdd..63c026495193 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -1233,7 +1233,6 @@ static void reada_for_search(struct btrfs_fs_info *fs_info,
>   	u64 target;
>   	u64 nread = 0;
>   	u64 nread_max;
> -	struct extent_buffer *eb;
>   	u32 nr;
>   	u32 blocksize;
>   	u32 nscan = 0;
> @@ -1262,10 +1261,14 @@ static void reada_for_search(struct btrfs_fs_info *fs_info,
>
>   	search = btrfs_node_blockptr(node, slot);
>   	blocksize = fs_info->nodesize;
> -	eb = find_extent_buffer(fs_info, search);
> -	if (eb) {
> -		free_extent_buffer(eb);
> -		return;
> +	if (path->reada != READA_FORWARD_ALWAYS) {

Currently only send uses this flag, but could it be applied to more call
sites like full device scrub?

Thanks,
Qu
> +		struct extent_buffer *eb;
> +
> +		eb = find_extent_buffer(fs_info, search);
> +		if (eb) {
> +			free_extent_buffer(eb);
> +			return;
> +		}
>   	}
>
>   	target = search;
>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] btrfs: continue readahead of siblings even if target node is in memory
  2021-07-20 15:03 [PATCH] btrfs: continue readahead of siblings even if target node is in memory fdmanana
  2021-07-20 23:57 ` Qu Wenruo
@ 2021-07-21 15:16 ` David Sterba
  1 sibling, 0 replies; 3+ messages in thread
From: David Sterba @ 2021-07-21 15:16 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Tue, Jul 20, 2021 at 04:03:03PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> At reada_for_search(), when attempting to readahead a node or leaf's
> siblings, we skip the readahead of the siblings if the node/leaf is
> already in memory. That is probably fine for the READA_FORWARD and
> READA_BACK readahead types, as they are used on contextes where we
> end up reading some consecutive leaves, but usually not the whole btree.
> 
> However for a READA_FORWARD_ALWAYS mode, currently only used for full
> send operations, it does not make sense to skip the readahead if the
> target node or leaf is already loaded in memory, since we know the caller
> is visiting every node and leaf of the btree in ascending order.
> 
> So change the behaviour to not skip the readahead when the target node is
> already in memory and the readahead mode is READA_FORWARD_ALWAYS.
> 
> The following test script was used to measure the improvement on a box
> using an average, consumer grade, spinning disk, with 32GiB of RAM and
> using a non-debug kernel config (Debian's default config).
> 
>   $ cat test.sh
>   #!/bin/bash
> 
>   DEV=/dev/sdj
>   MNT=/mnt/sdj
>   MKFS_OPTIONS="--nodesize 16384"     # default, just to be explicit
>   MOUNT_OPTIONS="-o max_inline=2048"  # default, just to be explicit
> 
>   mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
>   mount $MOUNT_OPTIONS $DEV $MNT
> 
>   # Create files with inline data to make it easier and faster to create
>   # large btrees.
>   add_files()
>   {
>       local total=$1
>       local start_offset=$2
>       local number_jobs=$3
>       local total_per_job=$(($total / $number_jobs))
> 
>       echo "Creating $total new files using $number_jobs jobs"
>       for ((n = 0; n < $number_jobs; n++)); do
>           (
>               local start_num=$(($start_offset + $n * $total_per_job))
>               for ((i = 1; i <= $total_per_job; i++)); do
>                   local file_num=$((start_num + $i))
>                   local file_path="$MNT/file_${file_num}"
>                   xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
>                   if [ $? -ne 0 ]; then
>                       echo "Failed creating file $file_path"
>                       break
>                   fi
>               done
>           ) &
>           worker_pids[$n]=$!
>       done
> 
>       wait ${worker_pids[@]}
> 
>       sync
>       echo
>       echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
>   }
> 
>   file_count=2000000
>   add_files $file_count 0 4
> 
>   echo
>   echo "Creating snapshot..."
>   btrfs subvolume snapshot -r $MNT $MNT/snap1
> 
>   umount $MNT
> 
>   echo 3 > /proc/sys/vm/drop_caches
>   blockdev --flushbufs $DEV &> /dev/null
>   hdparm -F $DEV &> /dev/null
> 
>   mount $MOUNT_OPTIONS $DEV $MNT
> 
>   echo
>   echo "Testing full send..."
>   start=$(date +%s)
>   btrfs send $MNT/snap1 > /dev/null
>   end=$(date +%s)
>   echo
>   echo "Full send took $((end - start)) seconds"
> 
>   umount $MNT
> 
> The duration of the full send operations, in seconds, were the following:
> 
> Before this change:  85 seconds
> After this change:   76 seconds (-11.2%)
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Nice, added to misc-next, thanks.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-07-21 15:19 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-20 15:03 [PATCH] btrfs: continue readahead of siblings even if target node is in memory fdmanana
2021-07-20 23:57 ` Qu Wenruo
2021-07-21 15:16 ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.