Age | Commit message (Collapse) | Author |
|
|
|
commit e8abe1de43dac658dacbd04a4543e0c988a8d386 upstream.
The error handling calls md_bitmap_free(bitmap) which checks for NULL
but will Oops if we pass an error pointer. Let's set "bitmap" to NULL
on this error path.
Fixes: afd756286083 ("md-cluster/raid10: resize all the bitmaps before start reshape")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
commit e766668c6cd49d741cfb49eaeb38998ba34d27bc upstream.
dm_stop_queue() only uses blk_mq_quiesce_queue() so it doesn't
formally stop the blk-mq queue; therefore there is no point making the
blk_mq_queue_stopped() check -- it will never be stopped.
In addition, even though dm_stop_queue() actually tries to quiesce hw
queues via blk_mq_quiesce_queue(), checking with blk_queue_quiesced()
to avoid unnecessary queue quiesce isn't reliable because: the
QUEUE_FLAG_QUIESCED flag is set before synchronize_rcu() and
dm_stop_queue() may be called when synchronize_rcu() from another
blk_mq_quiesce_queue() is in-progress.
Fixes: 7b17c2f7292ba ("dm: Fix a race condition related to stopping and starting queues")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
commit 7a1481267999c02abf4a624515c1b5c7c1fccbd6 upstream.
offset_to_stripe() returns the stripe number (in type unsigned int) from
an offset (in type uint64_t) by the following calculation,
do_div(offset, d->stripe_size);
For large capacity backing device (e.g. 18TB) with small stripe size
(e.g. 4KB), the result is 4831838208 and exceeds UINT_MAX. The actual
returned value which caller receives is 536870912, due to the overflow.
Indeed in bcache_device_init(), bcache_device->nr_stripes is limited in
range [1, INT_MAX]. Therefore all valid stripe numbers in bcache are
in range [0, bcache_dev->nr_stripes - 1].
This patch adds a upper limition check in offset_to_stripe(): the max
valid stripe number should be less than bcache_device->nr_stripes. If
the calculated stripe number from do_div() is equal to or larger than
bcache_device->nr_stripe, -EINVAL will be returned. (Normally nr_stripes
is less than INT_MAX, exceeding upper limitation doesn't mean overflow,
therefore -EOVERFLOW is not used as error code.)
This patch also changes nr_stripes' type of struct bcache_device from
'unsigned int' to 'int', and return value type of offset_to_stripe()
from 'unsigned int' to 'int', to match their exact data ranges.
All locations where bcache_device->nr_stripes and offset_to_stripe() are
referenced also get updated for the above type change.
Reported-and-tested-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1783075
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
commit 5fe48867856367142d91a82f2cbf7a57a24cbb70 upstream.
There are some meta data of bcache are allocated by multiple pages,
and they are used as bio bv_page for I/Os to the cache device. for
example cache_set->uuids, cache->disk_buckets, journal_write->data,
bset_tree->data.
For such meta data memory, all the allocated pages should be treated
as a single memory block. Then the memory management and underlying I/O
code can treat them more clearly.
This patch adds __GFP_COMP flag to all the location allocating >0 order
pages for the above mentioned meta data. Then their pages are treated
as compound pages now.
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
commit a1c6ae3d9f3dd6aa5981a332a6f700cf1c25edef upstream.
In degraded raid5, we need to read parity to do reconstruct-write when
data disks fail. However, we can not read parity from
handle_stripe_dirtying() in force reconstruct-write mode.
Reproducible Steps:
1. Create degraded raid5
mdadm -C /dev/md2 --assume-clean -l5 -n3 /dev/sda2 /dev/sdb2 missing
2. Set rmw_level to 0
echo 0 > /sys/block/md2/md/rmw_level
3. IO to raid5
Now some io may be stuck in raid5. We can use handle_stripe_fill() to read
the parity in this situation.
Cc: <stable@vger.kernel.org> # v4.4+
Reviewed-by: Alex Wu <alexwu@synology.com>
Reviewed-by: BingJing Chang <bingjingc@synology.com>
Reviewed-by: Danny Shih <dannyshih@synology.com>
Signed-off-by: ChangSyun Peng <allenpeng@synology.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
|
|
commit 117f636ea695270fe492d0c0c9dfadc7a662af47 upstream.
In register_cache_set(), c is pointer to struct cache_set, and ca is
pointer to struct cache, if ca->sb.seq > c->sb.seq, it means this
registering cache has up to date version and other members, the in-
memory version and other members should be updated to the newer value.
But current implementation makes a cache set only has a single cache
device, so the above assumption works well except for a special case.
The execption is when a cache device new created and both ca->sb.seq and
c->sb.seq are 0, because the super block is never flushed out yet. In
the location for the following if() check,
2156 if (ca->sb.seq > c->sb.seq) {
2157 c->sb.version = ca->sb.version;
2158 memcpy(c->sb.set_uuid, ca->sb.set_uuid, 16);
2159 c->sb.flags = ca->sb.flags;
2160 c->sb.seq = ca->sb.seq;
2161 pr_debug("set version = %llu\n", c->sb.version);
2162 }
c->sb.version is not initialized yet and valued 0. When ca->sb.seq is 0,
the if() check will fail (because both values are 0), and the cache set
version, set_uuid, flags and seq won't be updated.
The above problem is hiden for current code, because the bucket size is
compatible among different super block version. And the next time when
running cache set again, ca->sb.seq will be larger than 0 and cache set
super block version will be updated properly.
But if the large bucket feature is enabled, sb->bucket_size is the low
16bits of the bucket size. For a power of 2 value, when the actual
bucket size exceeds 16bit width, sb->bucket_size will always be 0. Then
read_super_common() will fail because the if() check to
is_power_of_2(sb->bucket_size) is false. This is how the long time
hidden bug is triggered.
This patch modifies the if() check to the following way,
2156 if (ca->sb.seq > c->sb.seq || c->sb.seq == 0) {
Then cache set's version, set_uuid, flags and seq will always be updated
corectly including for a new created cache device.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
commit 60f80d6f2d07a6d8aee485a1d1252327eeee0c81 upstream.
reproduction steps:
```
node1 # mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda
/dev/sdb
node2 # mdadm -A /dev/md0 /dev/sda /dev/sdb
node1 # mdadm -G /dev/md0 -b none
mdadm: failed to remove clustered bitmap.
node1 # mdadm -S --scan
^C <==== mdadm hung & kernel crash
```
kernel stack:
```
[ 335.230657] general protection fault: 0000 [#1] SMP NOPTI
[...]
[ 335.230848] Call Trace:
[ 335.230873] ? unlock_all_bitmaps+0x5/0x70 [md_cluster]
[ 335.230886] unlock_all_bitmaps+0x3d/0x70 [md_cluster]
[ 335.230899] leave+0x10f/0x190 [md_cluster]
[ 335.230932] ? md_super_wait+0x93/0xa0 [md_mod]
[ 335.230947] ? leave+0x5/0x190 [md_cluster]
[ 335.230973] md_cluster_stop+0x1a/0x30 [md_mod]
[ 335.230999] md_bitmap_free+0x142/0x150 [md_mod]
[ 335.231013] ? _cond_resched+0x15/0x40
[ 335.231025] ? mutex_lock+0xe/0x30
[ 335.231056] __md_stop+0x1c/0xa0 [md_mod]
[ 335.231083] do_md_stop+0x160/0x580 [md_mod]
[ 335.231119] ? 0xffffffffc05fb078
[ 335.231148] md_ioctl+0xa04/0x1930 [md_mod]
[ 335.231165] ? filename_lookup+0xf2/0x190
[ 335.231179] blkdev_ioctl+0x93c/0xa10
[ 335.231205] ? _cond_resched+0x15/0x40
[ 335.231214] ? __check_object_size+0xd4/0x1a0
[ 335.231224] block_ioctl+0x39/0x40
[ 335.231243] do_vfs_ioctl+0xa0/0x680
[ 335.231253] ksys_ioctl+0x70/0x80
[ 335.231261] __x64_sys_ioctl+0x16/0x20
[ 335.231271] do_syscall_64+0x65/0x1f0
[ 335.231278] entry_SYSCALL_64_after_hwframe+0x44/0xa9
```
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
|
|
commit 382761dc6312965a11f82f2217e16ec421bf17ae upstream.
bio_uninit is the proper API to clean up a BIO that has been allocated
on stack or inside a structure that doesn't come from the BIO allocator.
Switch dm to use that instead of bio_disassociate_blkg, which really is
an implementation detail. Note that the bio_uninit calls are also moved
to the two callers of __send_empty_flush, so that they better pair with
the bio_init calls used to initialize them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
|
|
|
|
commit 5df96f2b9f58a5d2dc1f30fe7de75e197f2c25f2 upstream.
Commit adc0daad366b62ca1bce3e2958a40b0b71a8b8b3 ("dm: report suspended
device during destroy") broke integrity recalculation.
The problem is dm_suspended() returns true not only during suspend,
but also during resume. So this race condition could occur:
1. dm_integrity_resume calls queue_work(ic->recalc_wq, &ic->recalc_work)
2. integrity_recalc (&ic->recalc_work) preempts the current thread
3. integrity_recalc calls if (unlikely(dm_suspended(ic->ti))) goto unlock_ret;
4. integrity_recalc exits and no recalculating is done.
To fix this race condition, add a function dm_post_suspending that is
only true during the postsuspend phase and use it instead of
dm_suspended().
Signed-off-by: Mikulas Patocka <mpatocka redhat com>
Fixes: adc0daad366b ("dm: report suspended device during destroy")
Cc: stable vger kernel org # v4.18+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
commit 6958c1c640af8c3f40fa8a2eee3b5b905d95b677 upstream.
kobject_uevent may allocate memory and it may be called while there are dm
devices suspended. The allocation may recurse into a suspended device,
causing a deadlock. We must set the noio flag when sending a uevent.
The observed deadlock was reported here:
https://www.redhat.com/archives/dm-devel/2020-March/msg00025.html
Reported-by: Khazhismel Kumykov <khazhy@google.com>
Reported-by: Tahsin Erdogan <tahsin@google.com>
Reported-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
commit 7b2377486767503d47265e4d487a63c651f6b55d upstream.
The unit of max_io_len is sector instead of byte (spotted through
code review), so fix it.
Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
|
|
commit d35bd764e6899a7bea71958f08d16cea5bfa1919 upstream.
Add cond_resched() to a loop that fills in the mapper memory area
because the loop can be executed many times.
Fixes: 48debafe4f2fe ("dm: add writecache target")
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
commit 39495b12ef1cf602e6abd350dce2ef4199906531 upstream.
When uncommitted entry has been discarded, correct wc->uncommitted_block
for getting the exact number.
Fixes: 48debafe4f2fe ("dm: add writecache target")
Cc: stable@vger.kernel.org
Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
commit 489dc0f06a5837f87482c0ce61d830d24e17082e upstream.
The only case where dmz_get_zone_for_reclaim() cannot return a zone is
if the respective lists are empty. So we should just return a simple
NULL value here as we really don't have an error code which would make
sense.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
commit 2361ae595352dec015d14292f1b539242d8446d6 upstream.
SCSI LUN passthrough code such as qemu's "scsi-block" device model
pass every IO to the host via SG_IO ioctls. Currently, dm-multipath
calls choose_pgpath() only in the block IO code path, not in the ioctl
code path (unless current_pgpath is NULL). This has the effect that no
path switching and thus no load balancing is done for SCSI-passthrough
IO, unless the active path fails.
Fix this by using the same logic in multipath_prepare_ioctl() as in
multipath_clone_and_map().
Note: The allegedly best path selection algorithm, service-time,
still wouldn't work perfectly, because the io size of the current
request is always set to 0. Changing that for the IO passthrough
case would require the ioctl cmd and arg to be passed to dm's
prepare_ioctl() method.
Signed-off-by: Martin Wilck <mwilck@suse.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
|
|
|
|
commit 86da9f736740eba602389908574dfbb0f517baa5 upstream.
The problematic code piece in bcache_device_free() is,
785 static void bcache_device_free(struct bcache_device *d)
786 {
787 struct gendisk *disk = d->disk;
[snipped]
799 if (disk) {
800 if (disk->flags & GENHD_FL_UP)
801 del_gendisk(disk);
802
803 if (disk->queue)
804 blk_cleanup_queue(disk->queue);
805
806 ida_simple_remove(&bcache_device_idx,
807 first_minor_to_idx(disk->first_minor));
808 put_disk(disk);
809 }
[snipped]
816 }
At line 808, put_disk(disk) may encounter kobject refcount of 'disk'
being underflow.
Here is how to reproduce the issue,
- Attche the backing device to a cache device and do random write to
make the cache being dirty.
- Stop the bcache device while the cache device has dirty data of the
backing device.
- Only register the backing device back, NOT register cache device.
- The bcache device node /dev/bcache0 won't show up, because backing
device waits for the cache device shows up for the missing dirty
data.
- Now echo 1 into /sys/fs/bcache/pendings_cleanup, to stop the pending
backing device.
- After the pending backing device stopped, use 'dmesg' to check kernel
message, a use-after-free warning from KASA reported the refcount of
kobject linked to the 'disk' is underflow.
The dropping refcount at line 808 in the above code piece is added by
add_disk(d->disk) in bch_cached_dev_run(). But in the above condition
the cache device is not registered, bch_cached_dev_run() has no chance
to be called and the refcount is not added. The put_disk() for a non-
added refcount of gendisk kobject triggers a underflow warning.
This patch checks whether GENHD_FL_UP is set in disk->flags, if it is
not set then the bcache device was not added, don't call put_disk()
and the the underflow issue can be avoided.
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
commit ba54d4d4d2844c234f1b4692bd8c9e0f833c8a54 upstream.
Using GFP_NOIO flag to call scribble_alloc() from resize_chunk() does
not have the expected behavior. kvmalloc_array() inside scribble_alloc()
which receives the GFP_NOIO flag will eventually call kmalloc_node() to
allocate physically continuous pages.
Now we have memalloc scope APIs in mddev_suspend()/mddev_resume() to
prevent memory reclaim I/Os during raid array suspend context, calling
to kvmalloc_array() with GFP_KERNEL flag may avoid deadlock of recursive
I/O as expected.
This patch removes the useless gfp flags from parameters list of
scribble_alloc(), and call kvmalloc_array() with GFP_KERNEL flag. The
incorrect GFP_NOIO flag does not exist anymore.
Fixes: b330e6a49dc3 ("md: convert to kvmalloc")
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
commit f6766ff6afff70e2aaf39e1511e16d471de7c3ae upstream.
We need to check mddev->del_work before flush workqueu since the purpose
of flush is to ensure the previous md is disappeared. Otherwise the similar
deadlock appeared if LOCKDEP is enabled, it is due to md_open holds the
bdev->bd_mutex before flush workqueue.
kernel: [ 154.522645] ======================================================
kernel: [ 154.522647] WARNING: possible circular locking dependency detected
kernel: [ 154.522650] 5.6.0-rc7-lp151.27-default #25 Tainted: G O
kernel: [ 154.522651] ------------------------------------------------------
kernel: [ 154.522653] mdadm/2482 is trying to acquire lock:
kernel: [ 154.522655] ffff888078529128 ((wq_completion)md_misc){+.+.}, at: flush_workqueue+0x84/0x4b0
kernel: [ 154.522673]
kernel: [ 154.522673] but task is already holding lock:
kernel: [ 154.522675] ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590
kernel: [ 154.522691]
kernel: [ 154.522691] which lock already depends on the new lock.
kernel: [ 154.522691]
kernel: [ 154.522694]
kernel: [ 154.522694] the existing dependency chain (in reverse order) is:
kernel: [ 154.522696]
kernel: [ 154.522696] -> #4 (&bdev->bd_mutex){+.+.}:
kernel: [ 154.522704] __mutex_lock+0x87/0x950
kernel: [ 154.522706] __blkdev_get+0x79/0x590
kernel: [ 154.522708] blkdev_get+0x65/0x140
kernel: [ 154.522709] blkdev_get_by_dev+0x2f/0x40
kernel: [ 154.522716] lock_rdev+0x3d/0x90 [md_mod]
kernel: [ 154.522719] md_import_device+0xd6/0x1b0 [md_mod]
kernel: [ 154.522723] new_dev_store+0x15e/0x210 [md_mod]
kernel: [ 154.522728] md_attr_store+0x7a/0xc0 [md_mod]
kernel: [ 154.522732] kernfs_fop_write+0x117/0x1b0
kernel: [ 154.522735] vfs_write+0xad/0x1a0
kernel: [ 154.522737] ksys_write+0xa4/0xe0
kernel: [ 154.522745] do_syscall_64+0x64/0x2b0
kernel: [ 154.522748] entry_SYSCALL_64_after_hwframe+0x49/0xbe
kernel: [ 154.522749]
kernel: [ 154.522749] -> #3 (&mddev->reconfig_mutex){+.+.}:
kernel: [ 154.522752] __mutex_lock+0x87/0x950
kernel: [ 154.522756] new_dev_store+0xc9/0x210 [md_mod]
kernel: [ 154.522759] md_attr_store+0x7a/0xc0 [md_mod]
kernel: [ 154.522761] kernfs_fop_write+0x117/0x1b0
kernel: [ 154.522763] vfs_write+0xad/0x1a0
kernel: [ 154.522765] ksys_write+0xa4/0xe0
kernel: [ 154.522767] do_syscall_64+0x64/0x2b0
kernel: [ 154.522769] entry_SYSCALL_64_after_hwframe+0x49/0xbe
kernel: [ 154.522770]
kernel: [ 154.522770] -> #2 (kn->count#253){++++}:
kernel: [ 154.522775] __kernfs_remove+0x253/0x2c0
kernel: [ 154.522778] kernfs_remove+0x1f/0x30
kernel: [ 154.522780] kobject_del+0x28/0x60
kernel: [ 154.522783] mddev_delayed_delete+0x24/0x30 [md_mod]
kernel: [ 154.522786] process_one_work+0x2a7/0x5f0
kernel: [ 154.522788] worker_thread+0x2d/0x3d0
kernel: [ 154.522793] kthread+0x117/0x130
kernel: [ 154.522795] ret_from_fork+0x3a/0x50
kernel: [ 154.522796]
kernel: [ 154.522796] -> #1 ((work_completion)(&mddev->del_work)){+.+.}:
kernel: [ 154.522800] process_one_work+0x27e/0x5f0
kernel: [ 154.522802] worker_thread+0x2d/0x3d0
kernel: [ 154.522804] kthread+0x117/0x130
kernel: [ 154.522806] ret_from_fork+0x3a/0x50
kernel: [ 154.522807]
kernel: [ 154.522807] -> #0 ((wq_completion)md_misc){+.+.}:
kernel: [ 154.522813] __lock_acquire+0x1392/0x1690
kernel: [ 154.522816] lock_acquire+0xb4/0x1a0
kernel: [ 154.522818] flush_workqueue+0xab/0x4b0
kernel: [ 154.522821] md_open+0xb6/0xc0 [md_mod]
kernel: [ 154.522823] __blkdev_get+0xea/0x590
kernel: [ 154.522825] blkdev_get+0x65/0x140
kernel: [ 154.522828] do_dentry_open+0x1d1/0x380
kernel: [ 154.522831] path_openat+0x567/0xcc0
kernel: [ 154.522834] do_filp_open+0x9b/0x110
kernel: [ 154.522836] do_sys_openat2+0x201/0x2a0
kernel: [ 154.522838] do_sys_open+0x57/0x80
kernel: [ 154.522840] do_syscall_64+0x64/0x2b0
kernel: [ 154.522842] entry_SYSCALL_64_after_hwframe+0x49/0xbe
kernel: [ 154.522844]
kernel: [ 154.522844] other info that might help us debug this:
kernel: [ 154.522844]
kernel: [ 154.522846] Chain exists of:
kernel: [ 154.522846] (wq_completion)md_misc --> &mddev->reconfig_mutex --> &bdev->bd_mutex
kernel: [ 154.522846]
kernel: [ 154.522850] Possible unsafe locking scenario:
kernel: [ 154.522850]
kernel: [ 154.522852] CPU0 CPU1
kernel: [ 154.522853] ---- ----
kernel: [ 154.522854] lock(&bdev->bd_mutex);
kernel: [ 154.522856] lock(&mddev->reconfig_mutex);
kernel: [ 154.522858] lock(&bdev->bd_mutex);
kernel: [ 154.522860] lock((wq_completion)md_misc);
kernel: [ 154.522861]
kernel: [ 154.522861] *** DEADLOCK ***
kernel: [ 154.522861]
kernel: [ 154.522864] 1 lock held by mdadm/2482:
kernel: [ 154.522865] #0: ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590
kernel: [ 154.522868]
kernel: [ 154.522868] stack backtrace:
kernel: [ 154.522873] CPU: 1 PID: 2482 Comm: mdadm Tainted: G O 5.6.0-rc7-lp151.27-default #25
kernel: [ 154.522875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
kernel: [ 154.522878] Call Trace:
kernel: [ 154.522881] dump_stack+0x8f/0xcb
kernel: [ 154.522884] check_noncircular+0x194/0x1b0
kernel: [ 154.522888] ? __lock_acquire+0x1392/0x1690
kernel: [ 154.522890] __lock_acquire+0x1392/0x1690
kernel: [ 154.522893] lock_acquire+0xb4/0x1a0
kernel: [ 154.522895] ? flush_workqueue+0x84/0x4b0
kernel: [ 154.522898] flush_workqueue+0xab/0x4b0
kernel: [ 154.522900] ? flush_workqueue+0x84/0x4b0
kernel: [ 154.522905] ? md_open+0xb6/0xc0 [md_mod]
kernel: [ 154.522908] md_open+0xb6/0xc0 [md_mod]
kernel: [ 154.522910] __blkdev_get+0xea/0x590
kernel: [ 154.522912] ? bd_acquire+0xc0/0xc0
kernel: [ 154.522914] blkdev_get+0x65/0x140
kernel: [ 154.522916] ? bd_acquire+0xc0/0xc0
kernel: [ 154.522918] do_dentry_open+0x1d1/0x380
kernel: [ 154.522921] path_openat+0x567/0xcc0
kernel: [ 154.522923] ? __lock_acquire+0x380/0x1690
kernel: [ 154.522926] do_filp_open+0x9b/0x110
kernel: [ 154.522929] ? __alloc_fd+0xe5/0x1f0
kernel: [ 154.522935] ? kmem_cache_alloc+0x28c/0x630
kernel: [ 154.522939] ? do_sys_openat2+0x201/0x2a0
kernel: [ 154.522941] do_sys_openat2+0x201/0x2a0
kernel: [ 154.522944] do_sys_open+0x57/0x80
kernel: [ 154.522946] do_syscall_64+0x64/0x2b0
kernel: [ 154.522948] entry_SYSCALL_64_after_hwframe+0x49/0xbe
kernel: [ 154.522951] RIP: 0033:0x7f98d279d9ae
And md_alloc also flushed the same workqueue, but the thing is different
here. Because all the paths call md_alloc don't hold bdev->bd_mutex, and
the flush is necessary to avoid race condition, so leave it as it is.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
commit be23e837333a914df3f24bf0b32e87b0331ab8d1 upstream.
coccicheck reports:
drivers/md//bcache/btree.c:1538:1-7: preceding lock on line 1417
In btree_gc_coalesce func, if the coalescing process fails, we will goto
to out_nocoalesce tag directly without releasing new_nodes[i]->write_lock.
Then, it will cause a deadlock when trying to acquire new_nodes[i]->
write_lock for freeing new_nodes[i] before return.
btree_gc_coalesce func details as follows:
if alloc new_nodes[i] fails:
goto out_nocoalesce;
// obtain new_nodes[i]->write_lock
mutex_lock(&new_nodes[i]->write_lock)
// main coalescing process
for (i = nodes - 1; i > 0; --i)
[snipped]
if coalescing process fails:
// Here, directly goto out_nocoalesce
// tag will cause a deadlock
goto out_nocoalesce;
[snipped]
// release new_nodes[i]->write_lock
mutex_unlock(&new_nodes[i]->write_lock)
// coalesing succ, return
return;
out_nocoalesce:
btree_node_free(new_nodes[i]) // free new_nodes[i]
// obtain new_nodes[i]->write_lock
mutex_lock(&new_nodes[i]->write_lock);
// set flag for reuse
clear_bit(BTREE_NODE_dirty, &ew_nodes[i]->flags);
// release new_nodes[i]->write_lock
mutex_unlock(&new_nodes[i]->write_lock);
To fix the problem, we add a new tag 'out_unlock_nocoalesce' for
releasing new_nodes[i]->write_lock before out_nocoalesce tag. If
coalescing process fails, we will go to out_unlock_nocoalesce tag
for releasing new_nodes[i]->write_lock before free new_nodes[i] in
out_nocoalesce tag.
(Coly Li helps to clean up commit log format.)
Fixes: 2a285686c109816 ("bcache: btree locking rework")
Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
|
|
commit 33f2c35a54dfd75ad0e7e86918dcbe4de799a56c upstream.
Due to a bug introduced in Linux 3.14 we cannot determine the
correctly layout for a multi-zone RAID0 array - there are two
possibilities.
It is possible to tell the kernel which to chose using a module
parameter, but this can be clumsy to use. It would be best if
the choice were recorded in the metadata.
So add a feature flag for this purpose.
If it is set, then the 'layout' field of the superblock is used
to determine which layout to use.
If this flag is not set, then mddev->layout gets set to -1,
which causes the module parameter to be required.
Acked-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
|
|
commit 64611a15ca9da91ff532982429c44686f4593b5f upstream.
queue_limits::logical_block_size got changed from unsigned short to
unsigned int, but it was forgotten to update crypt_io_hints() to use the
new type. Fix it.
Fixes: ad6bf88a6c19 ("block: fix an integer overflow in logical block size")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
|
|
Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
|
|
commit 5686dee34dbfe0238c0274e0454fa0174ac0a57a upstream.
When adding devices that don't have a scsi_dh on a BIO based multipath,
I was able to consistently hit the warning below and lock-up the system.
The problem is that __map_bio reads the flag before it potentially being
modified by choose_pgpath, and ends up using the older value.
The WARN_ON below is not trivially linked to the issue. It goes like
this: The activate_path delayed_work is not initialized for non-scsi_dh
devices, but we always set MPATHF_QUEUE_IO, asking for initialization.
That is fine, since MPATHF_QUEUE_IO would be cleared in choose_pgpath.
Nevertheless, only for BIO-based mpath, we cache the flag before calling
choose_pgpath, and use the older version when deciding if we should
initialize the path. Therefore, we end up trying to initialize the
paths, and calling the non-initialized activate_path work.
[ 82.437100] ------------[ cut here ]------------
[ 82.437659] WARNING: CPU: 3 PID: 602 at kernel/workqueue.c:1624
__queue_delayed_work+0x71/0x90
[ 82.438436] Modules linked in:
[ 82.438911] CPU: 3 PID: 602 Comm: systemd-udevd Not tainted 5.6.0-rc6+ #339
[ 82.439680] RIP: 0010:__queue_delayed_work+0x71/0x90
[ 82.440287] Code: c1 48 89 4a 50 81 ff 00 02 00 00 75 2a 4c 89 cf e9
94 d6 07 00 e9 7f e9 ff ff 0f 0b eb c7 0f 0b 48 81 7a 58 40 74 a8 94 74
a7 <0f> 0b 48 83 7a 48 00 74 a5 0f 0b eb a1 89 fe 4c 89 cf e9 c8 c4 07
[ 82.441719] RSP: 0018:ffffb738803977c0 EFLAGS: 00010007
[ 82.442121] RAX: ffffa086389f9740 RBX: 0000000000000002 RCX: 0000000000000000
[ 82.442718] RDX: ffffa086350dd930 RSI: ffffa0863d76f600 RDI: 0000000000000200
[ 82.443484] RBP: 0000000000000200 R08: 0000000000000000 R09: ffffa086350dd970
[ 82.444128] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa086350dd930
[ 82.444773] R13: ffffa0863d76f600 R14: 0000000000000000 R15: ffffa08636738008
[ 82.445427] FS: 00007f6abfe9dd40(0000) GS:ffffa0863dd80000(0000) knlGS:00000
[ 82.446040] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 82.446478] CR2: 0000557d288db4e8 CR3: 0000000078b36000 CR4: 00000000000006e0
[ 82.447104] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 82.447561] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 82.448012] Call Trace:
[ 82.448164] queue_delayed_work_on+0x6d/0x80
[ 82.448472] __pg_init_all_paths+0x7b/0xf0
[ 82.448714] pg_init_all_paths+0x26/0x40
[ 82.448980] __multipath_map_bio.isra.0+0x84/0x210
[ 82.449267] __map_bio+0x3c/0x1f0
[ 82.449468] __split_and_process_non_flush+0x14a/0x1b0
[ 82.449775] __split_and_process_bio+0xde/0x340
[ 82.450045] ? dm_get_live_table+0x5/0xb0
[ 82.450278] dm_process_bio+0x98/0x290
[ 82.450518] dm_make_request+0x54/0x120
[ 82.450778] generic_make_request+0xd2/0x3e0
[ 82.451038] ? submit_bio+0x3c/0x150
[ 82.451278] submit_bio+0x3c/0x150
[ 82.451492] mpage_readpages+0x129/0x160
[ 82.451756] ? bdev_evict_inode+0x1d0/0x1d0
[ 82.452033] read_pages+0x72/0x170
[ 82.452260] __do_page_cache_readahead+0x1ba/0x1d0
[ 82.452624] force_page_cache_readahead+0x96/0x110
[ 82.452903] generic_file_read_iter+0x84f/0xae0
[ 82.453192] ? __seccomp_filter+0x7c/0x670
[ 82.453547] new_sync_read+0x10e/0x190
[ 82.453883] vfs_read+0x9d/0x150
[ 82.454172] ksys_read+0x65/0xe0
[ 82.454466] do_syscall_64+0x4e/0x210
[ 82.454828] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[...]
[ 82.462501] ---[ end trace bb39975e9cf45daa ]---
Cc: stable@vger.kernel.org
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
commit 31b22120194b5c0d460f59e0c98504de1d3f1f14 upstream.
The dm-writecache reads metadata in the target constructor. However, when
we reload the target, there could be another active instance running on
the same device. This is the sequence of operations when doing a reload:
1. construct new target
2. suspend old target
3. resume new target
4. destroy old target
Metadata that were written by the old target between steps 1 and 2 would
not be visible by the new target.
Fix the data corruption by loading the metadata in the resume handler.
Also, validate block_size is at least as large as both the devices'
logical block size and only read 1 block from the metadata during
target constructor -- no need to read entirety of metadata now that it
is done during resume.
Fixes: 48debafe4f2f ("dm: add writecache target")
Cc: stable@vger.kernel.org # v4.18+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
commit ad4e80a639fc61d5ecebb03caa5cdbfb91fcebfc upstream.
The error correction data is computed as if data and hash blocks
were concatenated. But hash block number starts from v->hash_start.
So, we have to calculate hash block number based on that.
Fixes: a739ff3f543af ("dm verity: add support for forward error correction")
Cc: stable@vger.kernel.org
Signed-off-by: Sunwook Eom <speed.eom@samsung.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
commit b8fdd090376a7a46d17db316638fe54b965c2fb0 upstream.
zmd->nr_rnd_zones was increased twice by mistake. The other place it
is increased in dmz_init_zone() is the only one needed:
1131 zmd->nr_useable_zones++;
1132 if (dmz_is_rnd(zone)) {
1133 zmd->nr_rnd_zones++;
^^^
Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
commit 75fa601934fda23d2f15bf44b09c2401942d8e15 upstream.
Fix below kmemleak detected in verity_fec_ctr. output_pool is
allocated for each dm-verity-fec device. But it is not freed when
dm-table for the verity target is removed. Hence free the output
mempool in destructor function verity_fec_dtr.
unreferenced object 0xffffffffa574d000 (size 4096):
comm "init", pid 1667, jiffies 4294894890 (age 307.168s)
hex dump (first 32 bytes):
8e 36 00 98 66 a8 0b 9b 00 00 00 00 00 00 00 00 .6..f...........
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<0000000060e82407>] __kmalloc+0x2b4/0x340
[<00000000dd99488f>] mempool_kmalloc+0x18/0x20
[<000000002560172b>] mempool_init_node+0x98/0x118
[<000000006c3574d2>] mempool_init+0x14/0x20
[<0000000008cb266e>] verity_fec_ctr+0x388/0x3b0
[<000000000887261b>] verity_ctr+0x87c/0x8d0
[<000000002b1e1c62>] dm_table_add_target+0x174/0x348
[<000000002ad89eda>] table_load+0xe4/0x328
[<000000001f06f5e9>] dm_ctl_ioctl+0x3b4/0x5a0
[<00000000bee5fbb7>] do_vfs_ioctl+0x5dc/0x928
[<00000000b475b8f5>] __arm64_sys_ioctl+0x70/0x98
[<000000005361e2e8>] el0_svc_common+0xa0/0x158
[<000000001374818f>] el0_svc_handler+0x6c/0x88
[<000000003364e9f4>] el0_svc+0x8/0xc
[<000000009d84cec9>] 0xffffffffffffffff
Fixes: a739ff3f543af ("dm verity: add support for forward error correction")
Depends-on: 6f1c819c219f7 ("dm: convert to bioset_init()/mempool_init()")
Cc: stable@vger.kernel.org
Signed-off-by: Harshini Shetty <harshini.x.shetty@sony.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
commit b93b6643e9b5a7f260b931e97f56ffa3fa65e26d upstream.
If the user specifies tag size larger than HASH_MAX_DIGESTSIZE,
there's a crash in integrity_metadata().
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
commit 1edaa447d958bec24c6a79685a5790d98976fd16 upstream.
Initializing a dm-writecache device can take a long time when the
persistent memory device is large. Add cond_resched() to a few loops
to avoid warnings that the CPU is stuck.
Cc: stable@vger.kernel.org # v4.18+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
commit 6b40bec3b13278d21fa6c1ae7a0bdf2e550eed5f upstream.
Don't call quiesce(1) and quiesce(0) if array is already suspended,
otherwise in level_store, the array is writable after mddev_detach
in below part though the intention is to make array writable after
resume.
mddev_suspend(mddev);
mddev_detach(mddev);
...
mddev_resume(mddev);
And it also causes calltrace as follows in [1].
[48005.653834] WARNING: CPU: 1 PID: 45380 at kernel/kthread.c:510 kthread_park+0x77/0x90
[...]
[48005.653976] CPU: 1 PID: 45380 Comm: mdadm Tainted: G OE 5.4.10-arch1-1 #1
[48005.653979] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./J4105-ITX, BIOS P1.40 08/06/2018
[48005.653984] RIP: 0010:kthread_park+0x77/0x90
[48005.654015] Call Trace:
[48005.654039] r5l_quiesce+0x3c/0x70 [raid456]
[48005.654052] raid5_quiesce+0x228/0x2e0 [raid456]
[48005.654073] mddev_detach+0x30/0x70 [md_mod]
[48005.654090] level_store+0x202/0x670 [md_mod]
[48005.654099] ? security_capable+0x40/0x60
[48005.654114] md_attr_store+0x7b/0xc0 [md_mod]
[48005.654123] kernfs_fop_write+0xce/0x1b0
[48005.654132] vfs_write+0xb6/0x1a0
[48005.654138] ksys_write+0x67/0xe0
[48005.654146] do_syscall_64+0x4e/0x140
[48005.654155] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[48005.654161] RIP: 0033:0x7fa0c8737497
[1]: https://bugzilla.kernel.org/show_bug.cgi?id=206161
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
commit 120c9257f5f19e5d1e87efcbb5531b7cd81b7d74 upstream.
This reverts commit effd58c95f277744f75d6e08819ac859dbcbd351.
blk_queue_split() is causing excessive IO splitting -- because
blk_max_size_offset() depends on 'chunk_sectors' limit being set and
if it isn't (as is the case for DM targets!) it falls back to
splitting on a 'max_sectors' boundary regardless of offset.
"Fix" this by reverting back to _not_ using blk_queue_split() in
dm_process_bio() for normal IO (reads and writes). Long-term fix is
still TBD but it should focus on training blk_max_size_offset() to
call into a DM provided hook (to call DM's max_io_len()).
Test results from simple misaligned IO test on 4-way dm-striped device
with chunksize of 128K and stripesize of 512K:
xfs_io -d -c 'pread -b 2m 224s 4072s' /dev/mapper/stripe_dev
before this revert:
253,0 21 1 0.000000000 2206 Q R 224 + 4072 [xfs_io]
253,0 21 2 0.000008267 2206 X R 224 / 480 [xfs_io]
253,0 21 3 0.000010530 2206 X R 224 / 256 [xfs_io]
253,0 21 4 0.000027022 2206 X R 480 / 736 [xfs_io]
253,0 21 5 0.000028751 2206 X R 480 / 512 [xfs_io]
253,0 21 6 0.000033323 2206 X R 736 / 992 [xfs_io]
253,0 21 7 0.000035130 2206 X R 736 / 768 [xfs_io]
253,0 21 8 0.000039146 2206 X R 992 / 1248 [xfs_io]
253,0 21 9 0.000040734 2206 X R 992 / 1024 [xfs_io]
253,0 21 10 0.000044694 2206 X R 1248 / 1504 [xfs_io]
253,0 21 11 0.000046422 2206 X R 1248 / 1280 [xfs_io]
253,0 21 12 0.000050376 2206 X R 1504 / 1760 [xfs_io]
253,0 21 13 0.000051974 2206 X R 1504 / 1536 [xfs_io]
253,0 21 14 0.000055881 2206 X R 1760 / 2016 [xfs_io]
253,0 21 15 0.000057462 2206 X R 1760 / 1792 [xfs_io]
253,0 21 16 0.000060999 2206 X R 2016 / 2272 [xfs_io]
253,0 21 17 0.000062489 2206 X R 2016 / 2048 [xfs_io]
253,0 21 18 0.000066133 2206 X R 2272 / 2528 [xfs_io]
253,0 21 19 0.000067507 2206 X R 2272 / 2304 [xfs_io]
253,0 21 20 0.000071136 2206 X R 2528 / 2784 [xfs_io]
253,0 21 21 0.000072764 2206 X R 2528 / 2560 [xfs_io]
253,0 21 22 0.000076185 2206 X R 2784 / 3040 [xfs_io]
253,0 21 23 0.000077486 2206 X R 2784 / 2816 [xfs_io]
253,0 21 24 0.000080885 2206 X R 3040 / 3296 [xfs_io]
253,0 21 25 0.000082316 2206 X R 3040 / 3072 [xfs_io]
253,0 21 26 0.000085788 2206 X R 3296 / 3552 [xfs_io]
253,0 21 27 0.000087096 2206 X R 3296 / 3328 [xfs_io]
253,0 21 28 0.000093469 2206 X R 3552 / 3808 [xfs_io]
253,0 21 29 0.000095186 2206 X R 3552 / 3584 [xfs_io]
253,0 21 30 0.000099228 2206 X R 3808 / 4064 [xfs_io]
253,0 21 31 0.000101062 2206 X R 3808 / 3840 [xfs_io]
253,0 21 32 0.000104956 2206 X R 4064 / 4096 [xfs_io]
253,0 21 33 0.001138823 0 C R 4096 + 200 [0]
after this revert:
253,0 18 1 0.000000000 4430 Q R 224 + 3896 [xfs_io]
253,0 18 2 0.000018359 4430 X R 224 / 256 [xfs_io]
253,0 18 3 0.000028898 4430 X R 256 / 512 [xfs_io]
253,0 18 4 0.000033535 4430 X R 512 / 768 [xfs_io]
253,0 18 5 0.000065684 4430 X R 768 / 1024 [xfs_io]
253,0 18 6 0.000091695 4430 X R 1024 / 1280 [xfs_io]
253,0 18 7 0.000098494 4430 X R 1280 / 1536 [xfs_io]
253,0 18 8 0.000114069 4430 X R 1536 / 1792 [xfs_io]
253,0 18 9 0.000129483 4430 X R 1792 / 2048 [xfs_io]
253,0 18 10 0.000136759 4430 X R 2048 / 2304 [xfs_io]
253,0 18 11 0.000152412 4430 X R 2304 / 2560 [xfs_io]
253,0 18 12 0.000160758 4430 X R 2560 / 2816 [xfs_io]
253,0 18 13 0.000183385 4430 X R 2816 / 3072 [xfs_io]
253,0 18 14 0.000190797 4430 X R 3072 / 3328 [xfs_io]
253,0 18 15 0.000197667 4430 X R 3328 / 3584 [xfs_io]
253,0 18 16 0.000218751 4430 X R 3584 / 3840 [xfs_io]
253,0 18 17 0.000226005 4430 X R 3840 / 4096 [xfs_io]
253,0 18 18 0.000250404 4430 Q R 4120 + 176 [xfs_io]
253,0 18 19 0.000847708 0 C R 4096 + 24 [0]
253,0 18 20 0.000855783 0 C R 4120 + 176 [0]
Fixes: effd58c95f27774 ("dm: always call blk_queue_split() in dm_process_bio()")
Cc: stable@vger.kernel.org
Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
Tested-by: Barry Marson <bmarson@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
Signed-off-by: Bruce Ashfield <bruce.ashfield@gmail.com>
|
|
commit 248aa2645aa7fc9175d1107c2593cc90d4af5a4e upstream.
In cases where dec_in_flight() has to requeue the integrity_bio_wait
work to transfer the rest of the data, the bio's __bi_remaining might
already have been decremented to 0, e.g.: if bio passed to underlying
data device was split via blk_queue_split().
Use dm_bio_{record,restore} rather than effectively open-coding them in
dm-integrity -- these methods now manage __bi_remaining too.
Depends-on: f7f0b057a9c1 ("dm bio record: save/restore bi_end_io and bi_integrity")
Reported-by: Daniel Glöckner <dg@emlix.com>
Suggested-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
commit 1b17159e52bb31f982f82a6278acd7fab1d3f67b upstream.
Also, save/restore __bi_remaining in case the bio was used in a
BIO_CHAIN (e.g. due to blk_queue_split).
Suggested-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
commit 974f51e8633f0f3f33e8f86bbb5ae66758aa63c7 upstream.
We neither assign congested_fn for requested-based blk-mq device nor
implement it correctly. So fix both.
Also, remove incorrect comment from dm_init_normal_md_queue and rename
it to dm_init_congested_fn.
Fixes: 4aa9c692e052 ("bdi: separate out congested state into a separate struct")
Cc: stable@vger.kernel.org
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
commit 41c526c5af46d4c4dab7f72c99000b7fac0b9702 upstream.
Verify the watermark upon resume - so that if the target is reloaded
with lower watermark, it will start the cleanup process immediately.
Fixes: 48debafe4f2f ("dm: add writecache target")
Cc: stable@vger.kernel.org # 4.18+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
commit adc0daad366b62ca1bce3e2958a40b0b71a8b8b3 upstream.
The function dm_suspended returns true if the target is suspended.
However, when the target is being suspended during unload, it returns
false.
An example where this is a problem: the test "!dm_suspended(wc->ti)" in
writecache_writeback is not sufficient, because dm_suspended returns
zero while writecache_suspend is in progress. As is, without an
enhanced dm_suspended, simply switching from flush_workqueue to
drain_workqueue still emits warnings:
workqueue writecache-writeback: drain_workqueue() isn't complete after 10 tries
workqueue writecache-writeback: drain_workqueue() isn't complete after 100 tries
workqueue writecache-writeback: drain_workqueue() isn't complete after 200 tries
workqueue writecache-writeback: drain_workqueue() isn't complete after 300 tries
workqueue writecache-writeback: drain_workqueue() isn't complete after 400 tries
writecache_suspend calls flush_workqueue(wc->writeback_wq) - this function
flushes the current work. However, the workqueue may re-queue itself and
flush_workqueue doesn't wait for re-queued works to finish. Because of
this - the function writecache_writeback continues execution after the
device was suspended and then concurrently with writecache_dtr, causing
a crash in writecache_writeback.
We must use drain_workqueue - that waits until the work and all re-queued
works finish.
As a prereq for switching to drain_workqueue, this commit fixes
dm_suspended to return true after the presuspend hook and before the
postsuspend hook - just like during a normal suspend. It allows
simplifying the dm-integrity and dm-writecache targets so that they
don't have to maintain suspended flags on their own.
With this change use of drain_workqueue() can be used effectively. This
change was tested with the lvm2 testsuite and cryptsetup testsuite and
the are no regressions.
Fixes: 48debafe4f2f ("dm: add writecache target")
Cc: stable@vger.kernel.org # 4.18+
Reported-by: Corey Marthaler <cmarthal@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
commit 7cdf6a0aae1cccf5167f3f04ecddcf648b78e289 upstream.
The crash can be reproduced by running the lvm2 testsuite test
lvconvert-thin-external-cache.sh for several minutes, e.g.:
while :; do make check T=shell/lvconvert-thin-external-cache.sh; done
The crash happens in this call chain:
do_waker -> policy_tick -> smq_tick -> end_hotspot_period -> clear_bitset
-> memset -> __memset -- which accesses an invalid pointer in the vmalloc
area.
The work entry on the workqueue is executed even after the bitmap was
freed. The problem is that cancel_delayed_work doesn't wait for the
running work item to finish, so the work item can continue running and
re-submitting itself even after cache_postsuspend. In order to make sure
that the work item won't be running, we must use cancel_delayed_work_sync.
Also, change flush_workqueue to drain_workqueue, so that if some work item
submits itself or another work item, we are properly waiting for both of
them.
Fixes: c6b4fcbad044 ("dm: add cache target")
Cc: stable@vger.kernel.org # v3.9
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
|
commit 7fc2e47f40dd77ab1fcbda6db89614a0173d89c7 upstream.
If the flag SB_FLAG_RECALCULATE is present in the superblock, but it was
not specified on the command line (i.e. ic->recalculate_flag is false),
dm-integrity would return invalid table line - the reported number of
arguments would not match the real number.
Fixes: 468dfca38b1a ("dm integrity: add a bitmap mode")
Cc: stable@vger.kernel.org # v5.2+
Reported-by: Ondrej Kozina <okozina@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|