Skip to content

Commit 7e0c9e8

Browse files
author
Ming Lei
committed
block: restore two stage elevator switch while running nr_hw_queue update
JIRA: https://issues.redhat.com/browse/RHEL-112997 commit 5989bfe Author: Nilay Shroff <nilay@linux.ibm.com> Date: Thu Jul 24 15:31:51 2025 +0530 block: restore two stage elevator switch while running nr_hw_queue update The kmemleak reports memory leaks related to elevator resources that were originally allocated in the ->init_hctx() method. The following leak traces are observed after running blktests block/040: unreferenced object 0xffff8881b82f7400 (size 512): comm "check", pid 68454, jiffies 4310588881 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc 5bac8b34): __kvmalloc_node_noprof+0x55d/0x7a0 sbitmap_init_node+0x15a/0x6a0 kyber_init_hctx+0x316/0xb90 blk_mq_init_sched+0x419/0x580 elevator_switch+0x18b/0x630 elv_update_nr_hw_queues+0x219/0x2c0 __blk_mq_update_nr_hw_queues+0x36a/0x6f0 blk_mq_update_nr_hw_queues+0x3a/0x60 0xffffffffc09ceb80 0xffffffffc09d7e0b configfs_write_iter+0x2b1/0x470 vfs_write+0x527/0xe70 ksys_write+0xff/0x200 do_syscall_64+0x98/0x3c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e unreferenced object 0xffff8881b82f6000 (size 512): comm "check", pid 68454, jiffies 4310588881 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc 5bac8b34): __kvmalloc_node_noprof+0x55d/0x7a0 sbitmap_init_node+0x15a/0x6a0 kyber_init_hctx+0x316/0xb90 blk_mq_init_sched+0x419/0x580 elevator_switch+0x18b/0x630 elv_update_nr_hw_queues+0x219/0x2c0 __blk_mq_update_nr_hw_queues+0x36a/0x6f0 blk_mq_update_nr_hw_queues+0x3a/0x60 0xffffffffc09ceb80 0xffffffffc09d7e0b configfs_write_iter+0x2b1/0x470 vfs_write+0x527/0xe70 ksys_write+0xff/0x200 do_syscall_64+0x98/0x3c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e unreferenced object 0xffff8881b82f5800 (size 512): comm "check", pid 68454, jiffies 4310588881 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc 5bac8b34): __kvmalloc_node_noprof+0x55d/0x7a0 sbitmap_init_node+0x15a/0x6a0 kyber_init_hctx+0x316/0xb90 blk_mq_init_sched+0x419/0x580 elevator_switch+0x18b/0x630 elv_update_nr_hw_queues+0x219/0x2c0 __blk_mq_update_nr_hw_queues+0x36a/0x6f0 blk_mq_update_nr_hw_queues+0x3a/0x60 0xffffffffc09ceb80 0xffffffffc09d7e0b configfs_write_iter+0x2b1/0x470 vfs_write+0x527/0xe70 ksys_write+0xff/0x200 do_syscall_64+0x98/0x3c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e The issue arises while we run nr_hw_queue update, Specifically, we first reallocate hardware contexts (hctx) via __blk_mq_realloc_hw_ctxs(), and then later invoke elevator_switch() (assuming q->elevator is not NULL). The elevator switch code would first exit old elevator (elevator_exit) and then switches to the new elevator. The elevator_exit loops through each hctx and invokes the elevator’s per-hctx exit method ->exit_hctx(), which releases resources allocated during ->init_hctx(). This memleak manifests when we reduce the num of h/w queues - for example, when the initial update sets the number of queues to X, and a later update reduces it to Y, where Y < X. In this case, we'd loose the access to old hctxs while we get to elevator exit code because __blk_mq_realloc_hw_ctxs would have already released the old hctxs. As we don't now have any reference left to the old hctxs, we don't have any way to free the scheduler resources (which are allocate in ->init_hctx()) and kmemleak complains about it. This issue was caused due to the commit 596dce1 ("block: simplify elevator reattachment for updating nr_hw_queues"). That change unified the two-stage elevator teardown and reattachment into a single call that occurs after __blk_mq_realloc_hw_ctxs() has already freed the hctxs. This patch restores the previous two-stage elevator switch logic during nr_hw_queues updates. First, the elevator is switched to 'none', which ensures all scheduler resources are properly freed. Then, the hardware contexts (hctxs) are reallocated, and the software-to-hardware queue mappings are updated. Finally, the original elevator is reattached. This sequence prevents loss of references to old hctxs and avoids the scheduler resource leaks reported by kmemleak. Reported-by : Yi Zhang <yi.zhang@redhat.com> Fixes: 596dce1 ("block: simplify elevator reattachment for updating nr_hw_queues") Closes: https://lore.kernel.org/all/CAHj4cs8oJFvz=daCvjHM5dYCNQH4UXwSySPPU4v-WHce_kZXZA@mail.gmail.com/ Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250724102540.1366308-1-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ming Lei <ming.lei@redhat.com>
1 parent ab0a1ce commit 7e0c9e8

File tree

3 files changed

+81
-15
lines changed

3 files changed

+81
-15
lines changed

block/blk-mq.c

Lines changed: 75 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -4968,13 +4968,68 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
49684968
return ret;
49694969
}
49704970

4971+
/*
4972+
* Switch back to the elevator type stored in the xarray.
4973+
*/
4974+
static void blk_mq_elv_switch_back(struct request_queue *q,
4975+
struct xarray *elv_tbl)
4976+
{
4977+
struct elevator_type *e = xa_load(elv_tbl, q->id);
4978+
4979+
/* The elv_update_nr_hw_queues unfreezes the queue. */
4980+
elv_update_nr_hw_queues(q, e);
4981+
4982+
/* Drop the reference acquired in blk_mq_elv_switch_none. */
4983+
if (e)
4984+
elevator_put(e);
4985+
}
4986+
4987+
/*
4988+
* Stores elevator type in xarray and set current elevator to none. It uses
4989+
* q->id as an index to store the elevator type into the xarray.
4990+
*/
4991+
static int blk_mq_elv_switch_none(struct request_queue *q,
4992+
struct xarray *elv_tbl)
4993+
{
4994+
int ret = 0;
4995+
4996+
lockdep_assert_held_write(&q->tag_set->update_nr_hwq_lock);
4997+
4998+
/*
4999+
* Accessing q->elevator without holding q->elevator_lock is safe here
5000+
* because we're called from nr_hw_queue update which is protected by
5001+
* set->update_nr_hwq_lock in the writer context. So, scheduler update/
5002+
* switch code (which acquires the same lock in the reader context)
5003+
* can't run concurrently.
5004+
*/
5005+
if (q->elevator) {
5006+
5007+
ret = xa_insert(elv_tbl, q->id, q->elevator->type, GFP_KERNEL);
5008+
if (WARN_ON_ONCE(ret))
5009+
return ret;
5010+
5011+
/*
5012+
* Before we switch elevator to 'none', take a reference to
5013+
* the elevator module so that while nr_hw_queue update is
5014+
* running, no one can remove elevator module. We'd put the
5015+
* reference to elevator module later when we switch back
5016+
* elevator.
5017+
*/
5018+
__elevator_get(q->elevator->type);
5019+
5020+
elevator_set_none(q);
5021+
}
5022+
return ret;
5023+
}
5024+
49715025
static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
49725026
int nr_hw_queues)
49735027
{
49745028
struct request_queue *q;
49755029
int prev_nr_hw_queues = set->nr_hw_queues;
49765030
unsigned int memflags;
49775031
int i;
5032+
struct xarray elv_tbl;
49785033

49795034
lockdep_assert_held(&set->tag_list_lock);
49805035

@@ -4986,6 +5041,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
49865041
return;
49875042

49885043
memflags = memalloc_noio_save();
5044+
5045+
xa_init(&elv_tbl);
5046+
49895047
list_for_each_entry(q, &set->tag_list, tag_set_list) {
49905048
blk_mq_debugfs_unregister_hctxs(q);
49915049
blk_mq_sysfs_unregister_hctxs(q);
@@ -4994,11 +5052,17 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
49945052
list_for_each_entry(q, &set->tag_list, tag_set_list)
49955053
blk_mq_freeze_queue_nomemsave(q);
49965054

4997-
if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0) {
4998-
list_for_each_entry(q, &set->tag_list, tag_set_list)
4999-
blk_mq_unfreeze_queue_nomemrestore(q);
5000-
goto reregister;
5001-
}
5055+
/*
5056+
* Switch IO scheduler to 'none', cleaning up the data associated
5057+
* with the previous scheduler. We will switch back once we are done
5058+
* updating the new sw to hw queue mappings.
5059+
*/
5060+
list_for_each_entry(q, &set->tag_list, tag_set_list)
5061+
if (blk_mq_elv_switch_none(q, &elv_tbl))
5062+
goto switch_back;
5063+
5064+
if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0)
5065+
goto switch_back;
50025066

50035067
fallback:
50045068
blk_mq_update_queue_map(set);
@@ -5018,19 +5082,21 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
50185082
}
50195083
blk_mq_map_swqueue(q);
50205084
}
5021-
5022-
/* elv_update_nr_hw_queues() unfreeze queue for us */
5085+
switch_back:
5086+
/* The blk_mq_elv_switch_back unfreezes queue for us. */
50235087
list_for_each_entry(q, &set->tag_list, tag_set_list)
5024-
elv_update_nr_hw_queues(q);
5088+
blk_mq_elv_switch_back(q, &elv_tbl);
50255089

5026-
reregister:
50275090
list_for_each_entry(q, &set->tag_list, tag_set_list) {
50285091
blk_mq_sysfs_register_hctxs(q);
50295092
blk_mq_debugfs_register_hctxs(q);
50305093

50315094
blk_mq_remove_hw_queues_cpuhp(q);
50325095
blk_mq_add_hw_queues_cpuhp(q);
50335096
}
5097+
5098+
xa_destroy(&elv_tbl);
5099+
50345100
memalloc_noio_restore(memflags);
50355101

50365102
/* Free the excess tags when nr_hw_queues shrink. */

block/blk.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -321,7 +321,7 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
321321

322322
bool blk_insert_flush(struct request *rq);
323323

324-
void elv_update_nr_hw_queues(struct request_queue *q);
324+
void elv_update_nr_hw_queues(struct request_queue *q, struct elevator_type *e);
325325
void elevator_set_default(struct request_queue *q);
326326
void elevator_set_none(struct request_queue *q);
327327

block/elevator.c

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -689,21 +689,21 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
689689
* The I/O scheduler depends on the number of hardware queues, this forces a
690690
* reattachment when nr_hw_queues changes.
691691
*/
692-
void elv_update_nr_hw_queues(struct request_queue *q)
692+
void elv_update_nr_hw_queues(struct request_queue *q, struct elevator_type *e)
693693
{
694694
struct elv_change_ctx ctx = {};
695695
int ret = -ENODEV;
696696

697697
WARN_ON_ONCE(q->mq_freeze_depth == 0);
698698

699-
mutex_lock(&q->elevator_lock);
700-
if (q->elevator && !blk_queue_dying(q) && blk_queue_registered(q)) {
701-
ctx.name = q->elevator->type->elevator_name;
699+
if (e && !blk_queue_dying(q) && blk_queue_registered(q)) {
700+
ctx.name = e->elevator_name;
702701

702+
mutex_lock(&q->elevator_lock);
703703
/* force to reattach elevator after nr_hw_queue is updated */
704704
ret = elevator_switch(q, &ctx);
705+
mutex_unlock(&q->elevator_lock);
705706
}
706-
mutex_unlock(&q->elevator_lock);
707707
blk_mq_unfreeze_queue_nomemrestore(q);
708708
if (!ret)
709709
WARN_ON_ONCE(elevator_change_done(q, &ctx));

0 commit comments

Comments
 (0)