summaryrefslogtreecommitdiffstats
path: root/drivers/md
AgeCommit message (Collapse)Author
2018-05-30bcache: quit dc->writeback_thread when BCACHE_DEV_DETACHING is setColy Li
[ Upstream commit fadd94e05c02afec7b70b0b14915624f1782f578 ] In patch "bcache: fix cached_dev->count usage for bch_cache_set_error()", cached_dev_get() is called when creating dc->writeback_thread, and cached_dev_put() is called when exiting dc->writeback_thread. This modification works well unless people detach the bcache device manually by 'echo 1 > /sys/block/bcache<N>/bcache/detach' Because this sysfs interface only calls bch_cached_dev_detach() which wakes up dc->writeback_thread but does not stop it. The reason is, before patch "bcache: fix cached_dev->count usage for bch_cache_set_error()", inside bch_writeback_thread(), if cache is not dirty after writeback, cached_dev_put() will be called here. And in cached_dev_make_request() when a new write request makes cache from clean to dirty, cached_dev_get() will be called there. Since we don't operate dc->count in these locations, refcount d->count cannot be dropped after cache becomes clean, and cached_dev_detach_finish() won't be called to detach bcache device. This patch fixes the issue by checking whether BCACHE_DEV_DETACHING is set inside bch_writeback_thread(). If this bit is set and cache is clean (no existing writeback_keys), break the while-loop, call cached_dev_put() and quit the writeback thread. Please note if cache is still dirty, even BCACHE_DEV_DETACHING is set the writeback thread should continue to perform writeback, this is the original design of manually detach. It is safe to do the following check without locking, let me explain why, + if (!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags) && + (!atomic_read(&dc->has_dirty) || !dc->writeback_running)) { If the kenrel thread does not sleep and continue to run due to conditions are not updated in time on the running CPU core, it just consumes more CPU cycles and has no hurt. This should-sleep-but-run is safe here. We just focus on the should-run-but-sleep condition, which means the writeback thread goes to sleep in mistake while it should continue to run. 1, First of all, no matter the writeback thread is hung or not, kthread_stop() from cached_dev_detach_finish() will wake up it and terminate by making kthread_should_stop() return true. And in normal run time, bit on index BCACHE_DEV_DETACHING is always cleared, the condition !test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags) is always true and can be ignored as constant value. 2, If one of the following conditions is true, the writeback thread should go to sleep, "!atomic_read(&dc->has_dirty)" or "!dc->writeback_running)" each of them independently controls the writeback thread should sleep or not, let's analyse them one by one. 2.1 condition "!atomic_read(&dc->has_dirty)" If dc->has_dirty is set from 0 to 1 on another CPU core, bcache will call bch_writeback_queue() immediately or call bch_writeback_add() which indirectly calls bch_writeback_queue() too. In bch_writeback_queue(), wake_up_process(dc->writeback_thread) is called. It sets writeback thread's task state to TASK_RUNNING and following an implicit memory barrier, then tries to wake up the writeback thread. In writeback thread, its task state is set to TASK_INTERRUPTIBLE before doing the condition check. If other CPU core sets the TASK_RUNNING state after writeback thread setting TASK_INTERRUPTIBLE, the writeback thread will be scheduled to run very soon because its state is not TASK_INTERRUPTIBLE. If other CPU core sets the TASK_RUNNING state before writeback thread setting TASK_INTERRUPTIBLE, the implict memory barrier of wake_up_process() will make sure modification of dc->has_dirty on other CPU core is updated and observed on the CPU core of writeback thread. Therefore the condition check will correctly be false, and continue writeback code without sleeping. 2.2 condition "!dc->writeback_running)" dc->writeback_running can be changed via sysfs file, every time it is modified, a following bch_writeback_queue() is alwasy called. So the change is always observed on the CPU core of writeback thread. If dc->writeback_running is changed from 0 to 1 on other CPU core, this condition check will observe the modification and allow writeback thread to continue to run without sleeping. Now we can see, even without a locking protection, multiple conditions check is safe here, no deadlock or process hang up will happen. I compose a separte patch because that patch "bcache: fix cached_dev->count usage for bch_cache_set_error()" already gets a "Reviewed-by:" from Hannes Reinecke. Also this fix is not trivial and good for a separate patch. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Michael Lyle <mlyle@lyle.org> Cc: Hannes Reinecke <hare@suse.com> Cc: Huijun Tang <tang.junhui@zte.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30bcache: fix kcrashes with fio in RAID5 backend devTang Junhui
[ Upstream commit 60eb34ec5526e264c2bbaea4f7512d714d791caf ] Kernel crashed when run fio in a RAID5 backend bcache device, the call trace is bellow: [ 440.012034] kernel BUG at block/blk-ioc.c:146! [ 440.012696] invalid opcode: 0000 [#1] SMP NOPTI [ 440.026537] CPU: 2 PID: 2205 Comm: md127_raid5 Not tainted 4.15.0 #8 [ 440.027441] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 07/16 /2015 [ 440.028615] RIP: 0010:put_io_context+0x8b/0x90 [ 440.029246] RSP: 0018:ffffa8c882b43af8 EFLAGS: 00010246 [ 440.029990] RAX: 0000000000000000 RBX: ffffa8c88294fca0 RCX: 0000000000 0f4240 [ 440.031006] RDX: 0000000000000004 RSI: 0000000000000286 RDI: ffffa8c882 94fca0 [ 440.032030] RBP: ffffa8c882b43b10 R08: 0000000000000003 R09: ffff949cb8 0c1700 [ 440.033206] R10: 0000000000000104 R11: 000000000000b71c R12: 00000000000 01000 [ 440.034222] R13: 0000000000000000 R14: ffff949cad84db70 R15: ffff949cb11 bd1e0 [ 440.035239] FS: 0000000000000000(0000) GS:ffff949cba280000(0000) knlGS: 0000000000000000 [ 440.060190] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 440.084967] CR2: 00007ff0493ef000 CR3: 00000002f1e0a002 CR4: 00000000001 606e0 [ 440.110498] Call Trace: [ 440.135443] bio_disassociate_task+0x1b/0x60 [ 440.160355] bio_free+0x1b/0x60 [ 440.184666] bio_put+0x23/0x30 [ 440.208272] search_free+0x23/0x40 [bcache] [ 440.231448] cached_dev_write_complete+0x31/0x70 [bcache] [ 440.254468] closure_put+0xb6/0xd0 [bcache] [ 440.277087] request_endio+0x30/0x40 [bcache] [ 440.298703] bio_endio+0xa1/0x120 [ 440.319644] handle_stripe+0x418/0x2270 [raid456] [ 440.340614] ? load_balance+0x17b/0x9c0 [ 440.360506] handle_active_stripes.isra.58+0x387/0x5a0 [raid456] [ 440.380675] ? __release_stripe+0x15/0x20 [raid456] [ 440.400132] raid5d+0x3ed/0x5d0 [raid456] [ 440.419193] ? schedule+0x36/0x80 [ 440.437932] ? schedule_timeout+0x1d2/0x2f0 [ 440.456136] md_thread+0x122/0x150 [ 440.473687] ? wait_woken+0x80/0x80 [ 440.491411] kthread+0x102/0x140 [ 440.508636] ? find_pers+0x70/0x70 [ 440.524927] ? kthread_associate_blkcg+0xa0/0xa0 [ 440.541791] ret_from_fork+0x35/0x40 [ 440.558020] Code: c2 48 00 5b 41 5c 41 5d 5d c3 48 89 c6 4c 89 e7 e8 bb c2 48 00 48 8b 3d bc 36 4b 01 48 89 de e8 7c f7 e0 ff 5b 41 5c 41 5d 5d c3 <0f> 0b 0f 1f 00 0f 1f 44 00 00 55 48 8d 47 b8 48 89 e5 41 57 41 [ 440.610020] RIP: put_io_context+0x8b/0x90 RSP: ffffa8c882b43af8 [ 440.628575] ---[ end trace a1fd79d85643a73e ]-- All the crash issue happened when a bypass IO coming, in such scenario s->iop.bio is pointed to the s->orig_bio. In search_free(), it finishes the s->orig_bio by calling bio_complete(), and after that, s->iop.bio became invalid, then kernel would crash when calling bio_put(). Maybe its upper layer's faulty, since bio should not be freed before we calling bio_put(), but we'd better calling bio_put() first before calling bio_complete() to notify upper layer ending this bio. This patch moves bio_complete() under bio_put() to avoid kernel crash. [mlyle: fixed commit subject for character limits] Reported-by: Matthias Ferdinand <bcache@mfedv.net> Tested-by: Matthias Ferdinand <bcache@mfedv.net> Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30md/raid1: fix NULL pointer dereferenceYufen Yu
[ Upstream commit 3de59bb9d551428cbdc76a9ea57883f82e350b4d ] In handle_write_finished(), if r1_bio->bios[m] != NULL, it thinks the corresponding conf->mirrors[m].rdev is also not NULL. But, it is not always true. Even if some io hold replacement rdev(i.e. rdev->nr_pending.count > 0), raid1_remove_disk() can also set the rdev as NULL. That means, bios[m] != NULL, but mirrors[m].rdev is NULL, resulting in NULL pointer dereference in handle_write_finished and sync_request_write. This patch can fix BUGs as follows: BUG: unable to handle kernel NULL pointer dereference at 0000000000000140 IP: [<ffffffff815bbbbd>] raid1d+0x2bd/0xfc0 PGD 12ab52067 PUD 12f587067 PMD 0 Oops: 0000 [#1] SMP CPU: 1 PID: 2008 Comm: md3_raid1 Not tainted 4.1.44+ #130 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014 Call Trace: ? schedule+0x37/0x90 ? prepare_to_wait_event+0x83/0xf0 md_thread+0x144/0x150 ? wake_atomic_t_function+0x70/0x70 ? md_start_sync+0xf0/0xf0 kthread+0xd8/0xf0 ? kthread_worker_fn+0x160/0x160 ret_from_fork+0x42/0x70 ? kthread_worker_fn+0x160/0x160 BUG: unable to handle kernel NULL pointer dereference at 00000000000000b8 IP: sync_request_write+0x9e/0x980 PGD 800000007c518067 P4D 800000007c518067 PUD 8002b067 PMD 0 Oops: 0000 [#1] SMP PTI CPU: 24 PID: 2549 Comm: md3_raid1 Not tainted 4.15.0+ #118 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014 Call Trace: ? sched_clock+0x5/0x10 ? sched_clock_cpu+0xc/0xb0 ? flush_pending_writes+0x3a/0xd0 ? pick_next_task_fair+0x4d5/0x5f0 ? __switch_to+0xa2/0x430 raid1d+0x65a/0x870 ? find_pers+0x70/0x70 ? find_pers+0x70/0x70 ? md_thread+0x11c/0x160 md_thread+0x11c/0x160 ? finish_wait+0x80/0x80 kthread+0x111/0x130 ? kthread_create_worker_on_cpu+0x70/0x70 ? do_syscall_64+0x6f/0x190 ? SyS_exit_group+0x10/0x10 ret_from_fork+0x35/0x40 Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30md: fix a potential deadlock of raid5/raid10 reshapeBingJing Chang
[ Upstream commit 8876391e440ba615b10eef729576e111f0315f87 ] There is a potential deadlock if mount/umount happens when raid5_finish_reshape() tries to grow the size of emulated disk. How the deadlock happens? 1) The raid5 resync thread finished reshape (expanding array). 2) The mount or umount thread holds VFS sb->s_umount lock and tries to write through critical data into raid5 emulated block device. So it waits for raid5 kernel thread handling stripes in order to finish it I/Os. 3) In the routine of raid5 kernel thread, md_check_recovery() will be called first in order to reap the raid5 resync thread. That is, raid5_finish_reshape() will be called. In this function, it will try to update conf and call VFS revalidate_disk() to grow the raid5 emulated block device. It will try to acquire VFS sb->s_umount lock. The raid5 kernel thread cannot continue, so no one can handle mount/ umount I/Os (stripes). Once the write-through I/Os cannot be finished, mount/umount will not release sb->s_umount lock. The deadlock happens. The raid5 kernel thread is an emulated block device. It is responible to handle I/Os (stripes) from upper layers. The emulated block device should not request any I/Os on itself. That is, it should not call VFS layer functions. (If it did, it will try to acquire VFS locks to guarantee the I/Os sequence.) So we have the resync thread to send resync I/O requests and to wait for the results. For solving this potential deadlock, we can put the size growth of the emulated block device as the final step of reshape thread. 2017/12/29: Thanks to Guoqing Jiang <gqjiang@suse.com>, we confirmed that there is the same deadlock issue in raid10. It's reproducible and can be fixed by this patch. For raid10.c, we can remove the similar code to prevent deadlock as well since they has been called before. Reported-by: Alex Wu <alexwu@synology.com> Reviewed-by: Alex Wu <alexwu@synology.com> Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com> Signed-off-by: BingJing Chang <bingjingc@synology.com> Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30md: raid5: avoid string overflow warningArnd Bergmann
[ Upstream commit 53b8d89ddbdbb0e4625a46d2cdbb6f106c52f801 ] gcc warns about a possible overflow of the kmem_cache string, when adding four characters to a string of the same length: drivers/md/raid5.c: In function 'setup_conf': drivers/md/raid5.c:2207:34: error: '-alt' directive writing 4 bytes into a region of size between 1 and 32 [-Werror=format-overflow=] sprintf(conf->cache_name[1], "%s-alt", conf->cache_name[0]); ^~~~ drivers/md/raid5.c:2207:2: note: 'sprintf' output between 5 and 36 bytes into a destination of size 32 sprintf(conf->cache_name[1], "%s-alt", conf->cache_name[0]); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If I'm counting correctly, we need 11 characters for the fixed part of the string and 18 characters for a 64-bit pointer (when no gendisk is used), so that leaves three characters for conf->level, which should always be sufficient. This makes the code use snprintf() with the correct length, to make the code more robust against changes, and to get the compiler to shut up. In commit f4be6b43f1ac ("md/raid5: ensure we create a unique name for kmem_cache when mddev has no gendisk") from 2010, Neil said that the pointer could be removed "shortly" once devices without gendisk are disallowed. I have no idea if that happened, but if it did, that should probably be changed as well. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30md raid10: fix NULL deference in handle_write_completed()Yufen Yu
[ Upstream commit 01a69cab01c184d3786af09e9339311123d63d22 ] In the case of 'recover', an r10bio with R10BIO_WriteError & R10BIO_IsRecover will be progressed by handle_write_completed(). This function traverses all r10bio->devs[copies]. If devs[m].repl_bio != NULL, it thinks conf->mirrors[dev].replacement is also not NULL. However, this is not always true. When there is an rdev of raid10 has replacement, then each r10bio ->devs[m].repl_bio != NULL in conf->r10buf_pool. However, in 'recover', even if corresponded replacement is NULL, it doesn't clear r10bio ->devs[m].repl_bio, resulting in replacement NULL deference. This bug was introduced when replacement support for raid10 was added in Linux 3.3. As NeilBrown suggested: Elsewhere the determination of "is this device part of the resync/recovery" is made by resting bio->bi_end_io. If this is end_sync_write, then we tried to write here. If it is NULL, then we didn't try to write. Fixes: 9ad1aefc8ae8 ("md/raid10: Handle replacement devices during resync.") Cc: stable (V3.3+) Suggested-by: NeilBrown <neilb@suse.com> Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30md: fix md_write_start() deadlock w/o metadata devicesHeinz Mauelshagen
[ Upstream commit 4b6c1060eaa6495aa5b0032e8f2d51dd936b1257 ] If no metadata devices are configured on raid1/4/5/6/10 (e.g. via dm-raid), md_write_start() unconditionally waits for superblocks to be written thus deadlocking. Fix introduces mddev->has_superblocks bool, defines it in md_run() and checks for it in md_write_start() to conditionally avoid waiting. Once on it, check for non-existing superblocks in md_super_write(). Link: https://bugzilla.kernel.org/show_bug.cgi?id=198647 Fixes: cc27b0c78c796 ("md: fix deadlock between mddev_suspend() and md_write_start()") Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30MD: Free bioset when md_run failsXiao Ni
[ Upstream commit b126194cbb799f9980b92a77e58db6ad794c8082 ] Signed-off-by: Xiao Ni <xni@redhat.com> Acked-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-16dm integrity: use kvfree for kvmalloc'd memoryMikulas Patocka
commit fc8cec113904a47396bf0a1afc62920d66319d36 upstream. Use kvfree instead of kfree because the array is allocated with kvmalloc. Fixes: 7eada909bfd7a ("dm: add integrity target") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-26bcache: return attach error when no cache set existTang Junhui
[ Upstream commit 7f4fc93d4713394ee8f1cd44c238e046e11b4f15 ] I attach a back-end device to a cache set, and the cache set is not registered yet, this back-end device did not attach successfully, and no error returned: [root]# echo 87859280-fec6-4bcc-20df7ca8f86b > /sys/block/sde/bcache/attach [root]# In sysfs_attach(), the return value "v" is initialized to "size" in the beginning, and if no cache set exist in bch_cache_sets, the "v" value would not change any more, and return to sysfs, sysfs regard it as success since the "size" is a positive number. This patch fixes this issue by assigning "v" with "-ENOENT" in the initialization. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-26bcache: fix for data collapse after re-attaching an attached deviceTang Junhui
[ Upstream commit 73ac105be390c1de42a2f21643c9778a5e002930 ] back-end device sdm has already attached a cache_set with ID f67ebe1f-f8bc-4d73-bfe5-9dc88607f119, then try to attach with another cache set, and it returns with an error: [root]# cd /sys/block/sdm/bcache [root]# echo 5ccd0a63-148e-48b8-afa2-aca9cbd6279f > attach -bash: echo: write error: Invalid argument After that, execute a command to modify the label of bcache device: [root]# echo data_disk1 > label Then we reboot the system, when the system power on, the back-end device can not attach to cache_set, a messages show in the log: Feb 5 12:05:52 ceph152 kernel: [922385.508498] bcache: bch_cached_dev_attach() couldn't find uuid for sdm in set In sysfs_attach(), dc->sb.set_uuid was assigned to the value which input through sysfs, no matter whether it is success or not in bch_cached_dev_attach(). For example, If the back-end device has already attached to an cache set, bch_cached_dev_attach() would fail, but dc->sb.set_uuid was changed. Then modify the label of bcache device, it will call bch_write_bdev_super(), which would write the dc->sb.set_uuid to the super block, so we record a wrong cache set ID in the super block, after the system reboot, the cache set couldn't find the uuid of the back-end device, so the bcache device couldn't exist and use any more. In this patch, we don't assigned cache set ID to dc->sb.set_uuid in sysfs_attach() directly, but input it into bch_cached_dev_attach(), and assigned dc->sb.set_uuid to the cache set ID after the back-end device attached to the cache set successful. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-26bcache: fix for allocator and register thread raceTang Junhui
[ Upstream commit 682811b3ce1a5a4e20d700939a9042f01dbc66c4 ] After long time running of random small IO writing, I reboot the machine, and after the machine power on, I found bcache got stuck, the stack is: [root@ceph153 ~]# cat /proc/2510/task/*/stack [<ffffffffa06b2455>] closure_sync+0x25/0x90 [bcache] [<ffffffffa06b6be8>] bch_journal+0x118/0x2b0 [bcache] [<ffffffffa06b6dc7>] bch_journal_meta+0x47/0x70 [bcache] [<ffffffffa06be8f7>] bch_prio_write+0x237/0x340 [bcache] [<ffffffffa06a8018>] bch_allocator_thread+0x3c8/0x3d0 [bcache] [<ffffffff810a631f>] kthread+0xcf/0xe0 [<ffffffff8164c318>] ret_from_fork+0x58/0x90 [<ffffffffffffffff>] 0xffffffffffffffff [root@ceph153 ~]# cat /proc/2038/task/*/stack [<ffffffffa06b1abd>] __bch_btree_map_nodes+0x12d/0x150 [bcache] [<ffffffffa06b1bd1>] bch_btree_insert+0xf1/0x170 [bcache] [<ffffffffa06b637f>] bch_journal_replay+0x13f/0x230 [bcache] [<ffffffffa06c75fe>] run_cache_set+0x79a/0x7c2 [bcache] [<ffffffffa06c0cf8>] register_bcache+0xd48/0x1310 [bcache] [<ffffffff812f702f>] kobj_attr_store+0xf/0x20 [<ffffffff8125b216>] sysfs_write_file+0xc6/0x140 [<ffffffff811dfbfd>] vfs_write+0xbd/0x1e0 [<ffffffff811e069f>] SyS_write+0x7f/0xe0 [<ffffffff8164c3c9>] system_call_fastpath+0x16/0x1 The stack shows the register thread and allocator thread were getting stuck when registering cache device. I reboot the machine several times, the issue always exsit in this machine. I debug the code, and found the call trace as bellow: register_bcache() ==>run_cache_set() ==>bch_journal_replay() ==>bch_btree_insert() ==>__bch_btree_map_nodes() ==>btree_insert_fn() ==>btree_split() //node need split ==>btree_check_reserve() In btree_check_reserve(), It will check if there is enough buckets of RESERVE_BTREE type, since allocator thread did not work yet, so no buckets of RESERVE_BTREE type allocated, so the register thread waits on c->btree_cache_wait, and goes to sleep. Then the allocator thread initialized, the call trace is bellow: bch_allocator_thread() ==>bch_prio_write() ==>bch_journal_meta() ==>bch_journal() ==>journal_wait_for_write() In journal_wait_for_write(), It will check if journal is full by journal_full(), but the long time random small IO writing causes the exhaustion of journal buckets(journal.blocks_free=0), In order to release the journal buckets, the allocator calls btree_flush_write() to flush keys to btree nodes, and waits on c->journal.wait until btree nodes writing over or there has already some journal buckets space, then the allocator thread goes to sleep. but in btree_flush_write(), since bch_journal_replay() is not finished, so no btree nodes have journal (condition "if (btree_current_write(b)->journal)" never satisfied), so we got no btree node to flush, no journal bucket released, and allocator sleep all the times. Through the above analysis, we can see that: 1) Register thread wait for allocator thread to allocate buckets of RESERVE_BTREE type; 2) Alloctor thread wait for register thread to replay journal, so it can flush btree nodes and get journal bucket. then they are all got stuck by waiting for each other. Hua Rui provided a patch for me, by allocating some buckets of RESERVE_BTREE type in advance, so the register thread can get bucket when btree node splitting and no need to waiting for the allocator thread. I tested it, it has effect, and register thread run a step forward, but finally are still got stuck, the reason is only 8 bucket of RESERVE_BTREE type were allocated, and in bch_journal_replay(), after 2 btree nodes splitting, only 4 bucket of RESERVE_BTREE type left, then btree_check_reserve() is not satisfied anymore, so it goes to sleep again, and in the same time, alloctor thread did not flush enough btree nodes to release a journal bucket, so they all got stuck again. So we need to allocate more buckets of RESERVE_BTREE type in advance, but how much is enough? By experience and test, I think it should be as much as journal buckets. Then I modify the code as this patch, and test in the machine, and it works. This patch modified base on Hua Rui’s patch, and allocate more buckets of RESERVE_BTREE type in advance to avoid register thread and allocate thread going to wait for each other. [patch v2] ca->sb.njournal_buckets would be 0 in the first time after cache creation, and no journal exists, so just 8 btree buckets is OK. Signed-off-by: Hua Rui <huarui.dev@gmail.com> Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-26bcache: properly set task state in bch_writeback_thread()Coly Li
[ Upstream commit 99361bbf26337186f02561109c17a4c4b1a7536a ] Kernel thread routine bch_writeback_thread() has the following code block, 447 down_write(&dc->writeback_lock); 448~450 if (check conditions) { 451 up_write(&dc->writeback_lock); 452 set_current_state(TASK_INTERRUPTIBLE); 453 454 if (kthread_should_stop()) 455 return 0; 456 457 schedule(); 458 continue; 459 } If condition check is true, its task state is set to TASK_INTERRUPTIBLE and call schedule() to wait for others to wake up it. There are 2 issues in current code, 1, Task state is set to TASK_INTERRUPTIBLE after the condition checks, if another process changes the condition and call wake_up_process(dc-> writeback_thread), then at line 452 task state is set back to TASK_INTERRUPTIBLE, the writeback kernel thread will lose a chance to be waken up. 2, At line 454 if kthread_should_stop() is true, writeback kernel thread will return to kernel/kthread.c:kthread() with TASK_INTERRUPTIBLE and call do_exit(). It is not good to enter do_exit() with task state TASK_INTERRUPTIBLE, in following code path might_sleep() is called and a warning message is reported by __might_sleep(): "WARNING: do not call blocking ops when !TASK_RUNNING; state=1 set at [xxxx]". For the first issue, task state should be set before condition checks. Ineed because dc->writeback_lock is required when modifying all the conditions, calling set_current_state() inside code block where dc-> writeback_lock is hold is safe. But this is quite implicit, so I still move set_current_state() before all the condition checks. For the second issue, frankley speaking it does not hurt when kernel thread exits with TASK_INTERRUPTIBLE state, but this warning message scares users, makes them feel there might be something risky with bcache and hurt their data. Setting task state to TASK_RUNNING before returning fixes this problem. In alloc.c:allocator_wait(), there is also a similar issue, and is also fixed in this patch. Changelog: v3: merge two similar fixes into one patch v2: fix the race issue in v1 patch. v1: initial buggy fix. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Michael Lyle <mlyle@lyle.org> Cc: Michael Lyle <mlyle@lyle.org> Cc: Junhui Tang <tang.junhui@zte.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-26dm mpath: return DM_MAPIO_REQUEUE on blk-mq rq allocation failureMing Lei
[ Upstream commit 050af08ffb1b62af69196d61c22a0755f9a3cdbd ] blk-mq will rerun queue via RESTART or dispatch wake after one request is completed, so not necessary to wait random time for requeuing, we should trust blk-mq to do it. More importantly, we need to return BLK_STS_RESOURCE to blk-mq so that dequeuing from the I/O scheduler can be stopped, this results in improved I/O merging. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-24dm crypt: limit the number of allocated pagesMikulas Patocka
commit 5059353df86e2573ccd9d43fd9d9396dcec47ca2 upstream. dm-crypt consumes an excessive amount memory when the user attempts to zero a dm-crypt device with "blkdiscard -z". The command "blkdiscard -z" calls the BLKZEROOUT ioctl, it goes to the function __blkdev_issue_zeroout, __blkdev_issue_zeroout sends a large amount of write bios that contain the zero page as their payload. For each incoming page, dm-crypt allocates another page that holds the encrypted data, so when processing "blkdiscard -z", dm-crypt tries to allocate the amount of memory that is equal to the size of the device. This can trigger OOM killer or cause system crash. Fix this by limiting the amount of memory that dm-crypt allocates to 2% of total system memory. This limit is system-wide and is divided by the number of active dm-crypt devices and each device receives an equal share. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-12bcache: segregate flash only volume write streamsTang Junhui
[ Upstream commit 4eca1cb28d8b0574ca4f1f48e9331c5f852d43b9 ] In such scenario that there are some flash only volumes , and some cached devices, when many tasks request these devices in writeback mode, the write IOs may fall to the same bucket as bellow: | cached data | flash data | cached data | cached data| flash data| then after writeback of these cached devices, the bucket would be like bellow bucket: | free | flash data | free | free | flash data | So, there are many free space in this bucket, but since data of flash only volumes still exists, so this bucket cannot be reclaimable, which would cause waste of bucket space. In this patch, we segregate flash only volume write streams from cached devices, so data from flash only volumes and cached devices can store in different buckets. Compare to v1 patch, this patch do not add a additionally open bucket list, and it is try best to segregate flash only volume write streams from cached devices, sectors of flash only volumes may still be mixed with dirty sectors of cached device, but the number is very small. [mlyle: fixed commit log formatting, permissions, line endings] Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-12bcache: stop writeback thread after detachingTang Junhui
[ Upstream commit 8d29c4426b9f8afaccf28de414fde8a722b35fdf ] Currently, when a cached device detaching from cache, writeback thread is not stopped, and writeback_rate_update work is not canceled. For example, after the following command: echo 1 >/sys/block/sdb/bcache/detach you can still see the writeback thread. Then you attach the device to the cache again, bcache will create another writeback thread, for example, after below command: echo ba0fb5cd-658a-4533-9806-6ce166d883b9 > /sys/block/sdb/bcache/attach then you will see 2 writeback threads. This patch stops writeback thread and cancels writeback_rate_update work when cached device detaching from cache. Compare with patch v1, this v2 patch moves code down into the register lock for safety in case of any future changes as Coly and Mike suggested. [edit by mlyle: commit log spelling/formatting] Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-12bcache: ret IOERR when read meets metadata errorRui Hua
[ Upstream commit b221fc130c49c50f4c2250d22e873420765a9fa2 ] The read request might meet error when searching the btree, but the error was not handled in cache_lookup(), and this kind of metadata failure will not go into cached_dev_read_error(), finally, the upper layer will receive bi_status=0. In this patch we judge the metadata error by the return value of bch_btree_map_keys(), there are two potential paths give rise to the error: 1. Because the btree is not totally cached in memery, we maybe get error when read btree node from cache device (see bch_btree_node_get()), the likely errno is -EIO, -ENOMEM 2. When read miss happens, bch_btree_insert_check_key() will be called to insert a "replace_key" to btree(see cached_dev_cache_miss(), just for doing preparatory work before insert the missed data to cache device), a failure can also happen in this situation, the likely errno is -ENOMEM bch_btree_map_keys() will return MAP_DONE in normal scenario, but we will get either -EIO or -ENOMEM in above two cases. if this happened, we should NOT recover data from backing device (when cache device is dirty) because we don't know whether bkeys the read request covered are all clean. And after that happened, s->iop.status is still its initially value(0) before we submit s->bio.bio, we set it to BLK_STS_IOERR, so it can go into cached_dev_read_error(), and finally it can be passed to upper layer, or recovered by reread from backing device. [edit by mlyle: patch formatting, word-wrap, comment spelling, commit log format] Signed-off-by: Hua Rui <huarui.dev@gmail.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-19dm raid: fix raid set size revalidationHeinz Mauelshagen
[ Upstream commit 61e06e2c3ebd986050958513bfa40dceed756f8f ] The raid set size is being revalidated unconditionally before a reshaping conversion is started. MD requires the size to only be reduced in case of a stripe removing (i.e. shrinking) reshape but not when growing because the raid array has to stay small until after the growing reshape finishes. Fix by avoiding the size revalidation in preresume unless a shrinking reshape is requested. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-19dm mpath: fix passing integrity dataSteffen Maier
commit 8c5c147339d2e201108169327b1f99aa6d57d2cd upstream. After v4.12 commit e2460f2a4bc7 ("dm: mark targets that pass integrity data"), dm-multipath, e.g. on DIF+DIX SCSI disk paths, does not support block integrity any more. So add it to the whitelist. This is also a pre-requisite to use block integrity with other dm layer(s) on top of multipath, such as kpartx partitions (dm-linear) or LVM. Also, bump target version to reflect this fix. Fixes: e2460f2a4bc7 ("dm: mark targets that pass integrity data") Cc: <stable@vger.kernel.org> #4.12+ Bisected-by: Fedor Loshakov <loshakov@linux.vnet.ibm.com> Signed-off-by: Steffen Maier <maier@linux.vnet.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-15bcache: don't attach backing with duplicate UUIDMichael Lyle
commit 86755b7a96faed57f910f9e6b8061e019ac1ec08 upstream. This can happen e.g. during disk cloning. This is an incomplete fix: it does not catch duplicate UUIDs earlier when things are still unattached. It does not unregister the device. Further changes to cope better with this are planned but conflict with Coly's ongoing improvements to handling device errors. In the meantime, one can manually stop the device after this has happened. Attempts to attach a duplicate device result in: [ 136.372404] loop: module loaded [ 136.424461] bcache: register_bdev() registered backing device loop0 [ 136.424464] bcache: bch_cached_dev_attach() Tried to attach loop0 but duplicate UUID already attached My test procedure is: dd if=/dev/sdb1 of=imgfile bs=1024 count=262144 losetup -f imgfile Signed-off-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-15bcache: fix crashes in duplicate cache device registerTang Junhui
commit cc40daf91bdddbba72a4a8cd0860640e06668309 upstream. Kernel crashed when register a duplicate cache device, the call trace is bellow: [ 417.643790] CPU: 1 PID: 16886 Comm: bcache-register Tainted: G W OE 4.15.5-amd64-preempt-sysrq-20171018 #2 [ 417.643861] Hardware name: LENOVO 20ERCTO1WW/20ERCTO1WW, BIOS N1DET41W (1.15 ) 12/31/2015 [ 417.643870] RIP: 0010:bdevname+0x13/0x1e [ 417.643876] RSP: 0018:ffffa3aa9138fd38 EFLAGS: 00010282 [ 417.643884] RAX: 0000000000000000 RBX: ffff8c8f2f2f8000 RCX: ffffd6701f8 c7edf [ 417.643890] RDX: ffffa3aa9138fd88 RSI: ffffa3aa9138fd88 RDI: 00000000000 00000 [ 417.643895] RBP: ffffa3aa9138fde0 R08: ffffa3aa9138fae8 R09: 00000000000 1850e [ 417.643901] R10: ffff8c8eed34b271 R11: ffff8c8eed34b250 R12: 00000000000 00000 [ 417.643906] R13: ffffd6701f78f940 R14: ffff8c8f38f80000 R15: ffff8c8ea7d 90000 [ 417.643913] FS: 00007fde7e66f500(0000) GS:ffff8c8f61440000(0000) knlGS: 0000000000000000 [ 417.643919] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 417.643925] CR2: 0000000000000314 CR3: 00000007e6fa0001 CR4: 00000000003 606e0 [ 417.643931] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 00000000000 00000 [ 417.643938] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 00000000000 00400 [ 417.643946] Call Trace: [ 417.643978] register_bcache+0x1117/0x1270 [bcache] [ 417.643994] ? slab_pre_alloc_hook+0x15/0x3c [ 417.644001] ? slab_post_alloc_hook.isra.44+0xa/0x1a [ 417.644013] ? kernfs_fop_write+0xf6/0x138 [ 417.644020] kernfs_fop_write+0xf6/0x138 [ 417.644031] __vfs_write+0x31/0xcc [ 417.644043] ? current_kernel_time64+0x10/0x36 [ 417.644115] ? __audit_syscall_entry+0xbf/0xe3 [ 417.644124] vfs_write+0xa5/0xe2 [ 417.644133] SyS_write+0x5c/0x9f [ 417.644144] do_syscall_64+0x72/0x81 [ 417.644161] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 417.644169] RIP: 0033:0x7fde7e1c1974 [ 417.644175] RSP: 002b:00007fff13009a38 EFLAGS: 00000246 ORIG_RAX: 0000000 000000001 [ 417.644183] RAX: ffffffffffffffda RBX: 0000000001658280 RCX: 00007fde7e1c 1974 [ 417.644188] RDX: 000000000000000a RSI: 0000000001658280 RDI: 000000000000 0001 [ 417.644193] RBP: 000000000000000a R08: 0000000000000003 R09: 000000000000 0077 [ 417.644198] R10: 000000000000089e R11: 0000000000000246 R12: 000000000000 0001 [ 417.644203] R13: 000000000000000a R14: 7fffffffffffffff R15: 000000000000 0000 [ 417.644213] Code: c7 c2 83 6f ee 98 be 20 00 00 00 48 89 df e8 6c 27 3b 0 0 48 89 d8 5b c3 0f 1f 44 00 00 48 8b 47 70 48 89 f2 48 8b bf 80 00 00 00 <8 b> b0 14 03 00 00 e9 73 ff ff ff 0f 1f 44 00 00 48 8b 47 40 39 [ 417.644302] RIP: bdevname+0x13/0x1e RSP: ffffa3aa9138fd38 [ 417.644306] CR2: 0000000000000314 When registering duplicate cache device in register_cache(), after failure on calling register_cache_set(), bch_cache_release() will be called, then bdev will be freed, so bdevname(bdev, name) caused kernel crash. Since bch_cache_release() will free bdev, so in this patch we make sure bdev being freed if register_cache() fail, and do not free bdev again in register_bcache() when register_cache() fail. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reported-by: Marc MERLIN <marc@merlins.org> Tested-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Michael Lyle <mlyle@lyle.org> Cc: <stable@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-15dm bufio: avoid false-positive Wmaybe-uninitialized warningArnd Bergmann
commit 590347e4000356f55eb10b03ced2686bd74dab40 upstream. gcc-6.3 and earlier show a new warning after a seemingly unrelated change to the arm64 PAGE_KERNEL definition: In file included from drivers/md/dm-bufio.c:14:0: drivers/md/dm-bufio.c: In function 'alloc_buffer': include/linux/sched/mm.h:182:56: warning: 'noio_flag' may be used uninitialized in this function [-Wmaybe-uninitialized] current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags; ^ The same warning happened earlier on linux-3.18 for MIPS and I did a workaround for that, but now it's come back. gcc-7 and newer are apparently smart enough to figure this out, and other architectures don't show it, so the best I could come up with is to rework the caller slightly in a way that makes it obvious enough to all arm64 compilers what is happening here. Fixes: 41acec624087 ("arm64: kpti: Make use of nG dependent on arm64_kernel_unmapped_at_el0()") Link: https://patchwork.kernel.org/patch/9692829/ Cc: stable@vger.kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> [snitzer: moved declarations inside conditional, altered vmalloc return] Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-08md: only allow remove_and_add_spares when no sync_thread running.NeilBrown
commit 39772f0a7be3b3dc26c74ea13fe7847fd1522c8b upstream. The locking protocols in md assume that a device will never be removed from an array during resync/recovery/reshape. When that isn't happening, rcu or reconfig_mutex is needed to protect an rdev pointer while taking a refcount. When it is happening, that protection isn't needed. Unfortunately there are cases were remove_and_add_spares() is called when recovery might be happening: is state_store(), slot_store() and hot_remove_disk(). In each case, this is just an optimization, to try to expedite removal from the personality so the device can be removed from the array. If resync etc is happening, we just have to wait for md_check_recover to find a suitable time to call remove_and_add_spares(). This optimization and not essential so it doesn't matter if it fails. So change remove_and_add_spares() to abort early if resync/recovery/reshape is happening, unless it is called from md_check_recovery() as part of a newly started recovery. The parameter "this" is only NULL when called from md_check_recovery() so when it is NULL, there is no need to abort. As this can result in a NULL dereference, the fix is suitable for -stable. cc: yuyufen <yuyufen@huawei.com> Cc: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Fixes: 8430e7e0af9a ("md: disconnect device from personality before trying to remove it.") Cc: stable@ver.kernel.org (v4.8+) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <sh.li@alibaba-inc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25md/raid1/10: add missed blk plugShaohua Li
[ Upstream commit 18022a1bd3709b74ca31ef0b28fccd52bcd6c504 ] flush_pending_writes isn't always called with block plug, so add it, and plug works in nested way. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25md/raid5: correct degraded calculation in raid5_errorbingjingc
[ Upstream commit aff69d89bdebc39235cddb4445371eb979b49685 ] When disk failure occurs on new disks for reshape, mddev->degraded is not calculated correctly. Faulty bit of the failure device is not set before raid5_calc_degraded(conf). mdadm --create /dev/md0 --level=5 --raid-devices=3 /dev/loop[012] mdadm /dev/md0 -a /dev/loop3 mdadm /dev/md0 --grow -n4 mdadm /dev/md0 -f /dev/loop3 # simulating disk failure cat /sys/block/md0/md/degraded # it outputs 0, but it should be 1. However, mdadm -D /dev/md0 will show that it is degraded. It's a bug. It can be fixed by moving the resources raid5_calc_degraded() depends on before it. Reported-by: Roy Chung <roychung@synology.com> Reviewed-by: Alex Wu <alexwu@synology.com> Signed-off-by: BingJing Chang <bingjingc@synology.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22dm: correctly handle chained bios in dec_pending()NeilBrown
commit 8dd601fa8317243be887458c49f6c29c2f3d719f upstream. dec_pending() is given an error status (possibly 0) to be recorded against a bio. It can be called several times on the one 'struct dm_io', and it is careful to only assign a non-zero error to io->status. However when it then assigned io->status to bio->bi_status, it is not careful and could overwrite a genuine error status with 0. This can happen when chained bios are in use. If a bio is chained beneath the bio that this dm_io is handling, the child bio might complete and set bio->bi_status before the dm_io completes. This has been possible since chained bios were introduced in 3.14, and has become a lot easier to trigger with commit 18a25da84354 ("dm: ensure bio submission follows a depth-first tree walk") as that commit caused dm to start using chained bios itself. A particular failure mode is that if a bio spans an 'error' target and a working target, the 'error' fragment will complete instantly and set the ->bi_status, and the other fragment will normally complete a little later, and will clear ->bi_status. The fix is simply to only assign io_error to bio->bi_status when io_error is not zero. Reported-and-tested-by: Milan Broz <gmazyland@gmail.com> Cc: stable@vger.kernel.org (v3.14+) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-03bcache: check return value of register_shrinkerMichael Lyle
[ Upstream commit 6c4ca1e36cdc1a0a7a84797804b87920ccbebf51 ] register_shrinker is now __must_check, so check it to kill a warning. Caller of bch_btree_cache_alloc in super.c appropriately checks return value so this is fully plumbed through. This V2 fixes checkpatch warnings and improves the commit description, as I was too hasty getting the previous version out. Signed-off-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Vojtech Pavlik <vojtech@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23dm crypt: fix error return code in crypt_ctr()Wei Yongjun
commit 3cc2e57c4beabcbbaa46e1ac6d77ca8276a4a42d upstream. Fix to return error code -ENOMEM from the mempool_create_kmalloc_pool() error handling case instead of 0, as done elsewhere in this function. Fixes: ef43aa38063a6 ("dm crypt: add cryptographic data integrity protection (authenticated encryption)") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23dm crypt: wipe kernel key copy after IV initializationOndrej Kozina
commit dc94902bde1e158cd19c4deab208e5d6eb382a44 upstream. Loading key via kernel keyring service erases the internal key copy immediately after we pass it in crypto layer. This is wrong because IV is initialized later and we use wrong key for the initialization (instead of real key there's just zeroed block). The bug may cause data corruption if key is loaded via kernel keyring service first and later same crypt device is reactivated using exactly same key in hexbyte representation, or vice versa. The bug (and fix) affects only ciphers using following IVs: essiv, lmk and tcw. Fixes: c538f6ec9f56 ("dm crypt: add ability to use keys from the kernel key retention service") Signed-off-by: Ondrej Kozina <okozina@redhat.com> Reviewed-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23dm crypt: fix crash by adding missing check for auth key sizeMilan Broz
commit 27c7003697fc2c78f965984aa224ef26cd6b2949 upstream. If dm-crypt uses authenticated mode with separate MAC, there are two concatenated part of the key structure - key(s) for encryption and authentication key. Add a missing check for authenticated key length. If this key length is smaller than actually provided key, dm-crypt now properly fails instead of crashing. Fixes: ef43aa3806 ("dm crypt: add cryptographic data integrity protection (authenticated encryption)") Reported-by: Salah Coronya <salahx@yahoo.com> Signed-off-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23dm integrity: don't store cipher request on the stackMikulas Patocka
commit 717f4b1c52135f279112df82583e0c77e80f90de upstream. Some asynchronous cipher implementations may use DMA. The stack may be mapped in the vmalloc area that doesn't support DMA. Therefore, the cipher request and initialization vector shouldn't be on the stack. Fix this by allocating the request and iv with kmalloc. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23dm thin metadata: THIN_MAX_CONCURRENT_LOCKS should be 6Dennis Yang
commit 490ae017f54e55bde382d45ea24bddfb6d1a0aaf upstream. For btree removal, there is a corner case that a single thread could takes 6 locks which is more than THIN_MAX_CONCURRENT_LOCKS(5) and leads to deadlock. A btree removal might eventually call rebalance_children()->rebalance3() to rebalance entries of three neighbor child nodes when shadow_spine has already acquired two write locks. In rebalance3(), it tries to shadow and acquire the write locks of all three child nodes. However, shadowing a child node requires acquiring a read lock of the original child node and a write lock of the new block. Although the read lock will be released after block shadowing, shadowing the third child node in rebalance3() could still take the sixth lock. (2 write locks for shadow_spine + 2 write locks for the first two child nodes's shadow + 1 write lock for the last child node's shadow + 1 read lock for the last child node) Signed-off-by: Dennis Yang <dennisyang@qnap.com> Acked-by: Joe Thornber <thornber@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23dm btree: fix serious bug in btree_split_beneath()Joe Thornber
commit bc68d0a43560e950850fc69b58f0f8254b28f6d6 upstream. When inserting a new key/value pair into a btree we walk down the spine of btree nodes performing the following 2 operations: i) space for a new entry ii) adjusting the first key entry if the new key is lower than any in the node. If the _root_ node is full, the function btree_split_beneath() allocates 2 new nodes, and redistibutes the root nodes entries between them. The root node is left with 2 entries corresponding to the 2 new nodes. btree_split_beneath() then adjusts the spine to point to one of the two new children. This means the first key is never adjusted if the new key was lower, ie. operation (ii) gets missed out. This can result in the new key being 'lost' for a period; until another low valued key is inserted that will uncover it. This is a serious bug, and quite hard to make trigger in normal use. A reproducing test case ("thin create devices-in-reverse-order") is available as part of the thin-provision-tools project: https://github.com/jthornber/thin-provisioning-tools/blob/master/functional-tests/device-mapper/dm-tests.scm#L593 Fix the issue by changing btree_split_beneath() so it no longer adjusts the spine. Instead it unlocks both the new nodes, and lets the main loop in btree_insert_raw() relock the appropriate one and make any neccessary adjustments. Reported-by: Monty Pavel <monty_pavel@sina.com> Signed-off-by: Joe Thornber <thornber@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17dm bufio: fix shrinker scans when (nr_to_scan < retain_target)Suren Baghdasaryan
commit fbc7c07ec23c040179384a1f16b62b6030eb6bdd upstream. When system is under memory pressure it is observed that dm bufio shrinker often reclaims only one buffer per scan. This change fixes the following two issues in dm bufio shrinker that cause this behavior: 1. ((nr_to_scan - freed) <= retain_target) condition is used to terminate slab scan process. This assumes that nr_to_scan is equal to the LRU size, which might not be correct because do_shrink_slab() in vmscan.c calculates nr_to_scan using multiple inputs. As a result when nr_to_scan is less than retain_target (64) the scan will terminate after the first iteration, effectively reclaiming one buffer per scan and making scans very inefficient. This hurts vmscan performance especially because mutex is acquired/released every time dm_bufio_shrink_scan() is called. New implementation uses ((LRU size - freed) <= retain_target) condition for scan termination. LRU size can be safely determined inside __scan() because this function is called after dm_bufio_lock(). 2. do_shrink_slab() uses value returned by dm_bufio_shrink_count() to determine number of freeable objects in the slab. However dm_bufio always retains retain_target buffers in its LRU and will terminate a scan when this mark is reached. Therefore returning the entire LRU size from dm_bufio_shrink_count() is misleading because that does not represent the number of freeable objects that slab will reclaim during a scan. Returning (LRU size - retain_target) better represents the number of freeable objects in the slab. This way do_shrink_slab() returns 0 when (LRU size < retain_target) and vmscan will not try to scan this shrinker avoiding scans that will not reclaim any memory. Test: tested using Android device running <AOSP>/system/extras/alloc-stress that generates memory pressure and causes intensive shrinker scans Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-25md: always set THREAD_WAKEUP and wake up wqueue if thread existedGuoqing Jiang
[ Upstream commit d1d90147c9680aaec4a5757932c2103c42c9c23b ] Since commit 4ad23a976413 ("MD: use per-cpu counter for writes_pending"), the wait_queue is only got invoked if THREAD_WAKEUP is not set previously. With above change, I can see process_metadata_update could always hang on the wait queue, because mddev->thread could stay on 'D' status and the THREAD_WAKEUP flag is not cleared since there are lots of place to wake up mddev->thread. Then deadlock happened as follows: linux175:~ # ps aux|grep md|grep D root 20117 0.0 0.0 0 0 ? D 03:45 0:00 [md0_raid1] root 20125 0.0 0.0 0 0 ? D 03:45 0:00 [md0_cluster_rec] linux175:~ # cat /proc/20117/stack [<ffffffffa0635604>] dlm_lock_sync+0x94/0xd0 [md_cluster] [<ffffffffa0635674>] lock_token+0x34/0xd0 [md_cluster] [<ffffffffa0635804>] metadata_update_start+0x64/0x110 [md_cluster] [<ffffffffa04d985b>] md_update_sb.part.58+0x9b/0x860 [md_mod] [<ffffffffa04da035>] md_update_sb+0x15/0x30 [md_mod] [<ffffffffa04dc066>] md_check_recovery+0x266/0x490 [md_mod] [<ffffffffa06450e2>] raid1d+0x42/0x810 [raid1] [<ffffffffa04d2252>] md_thread+0x122/0x150 [md_mod] [<ffffffff81091741>] kthread+0x101/0x140 linux175:~ # cat /proc/20125/stack [<ffffffffa0636679>] recv_daemon+0x3f9/0x5c0 [md_cluster] [<ffffffffa04d2252>] md_thread+0x122/0x150 [md_mod] [<ffffffff81091741>] kthread+0x101/0x140 So let's revert the part of code in the commit to resovle the problem since we can't get lots of benefits of previous change. Fixes: 4ad23a976413 ("MD: use per-cpu counter for writes_pending") Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-25locking/barriers: Convert users of lockless_dereference() to READ_ONCE()Will Deacon
commit 3382290ed2d5e275429cef510ab21889d3ccd164 upstream. [ Note, this is a Git cherry-pick of the following commit: 506458efaf15 ("locking/barriers: Convert users of lockless_dereference() to READ_ONCE()") ... for easier x86 PTI code testing and back-porting. ] READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it can be used instead of lockless_dereference() without any change in semantics. Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1508840570-22169-4-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20md-cluster: fix wrong condition check in raid1_write_requestGuoqing Jiang
[ Upstream commit 385f4d7f946b08f36f68b0a28e95a319925b6b62 ] The check used here is to avoid conflict between write and resync, however we used the wrong logic, it should be the inverse of the checking inside "if". Fixes: 589a1c4 ("Suspend writes in RAID1 if within range") Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20raid5-ppl: check recovery_offset when performing ppl recoveryArtur Paszkiewicz
[ Upstream commit 07719ff767dcd8cc42050f185d332052f3816546 ] If starting an array that is undergoing rebuild, make ppl recovery honor the recovery_offset of a member disk and don't read data that is not yet in-sync. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20raid5: Set R5_Expanded on parity devices as well as data.NeilBrown
[ Upstream commit 235b6003fb28f0dd8e7ed8fbdb088bb548291766 ] When reshaping a fully degraded raid5/raid6 to a larger nubmer of devices, the new device(s) are not in-sync and so that can make the newly grown stripe appear to be "failed". To avoid this, we set the R5_Expanded flag to say "Even though this device is not fully in-sync, this block is safe so don't treat the device as failed for this stripe". This flag is set for data devices, not not for parity devices. Consequently, if you have a RAID6 with two devices that are partly recovered and a spare, and start a reshape to include the spare, then when the reshape gets past the point where the recovery was up to, it will think the stripes are failed and will get into an infinite loop, failing to make progress. So when contructing parity on an EXPAND_READY stripe, set R5_Expanded. Reported-by: Curt <lightspd@gmail.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20bcache: fix wrong cache_misses statisticstang.junhui
[ Upstream commit c157313791a999646901b3e3c6888514ebc36d62 ] Currently, Cache missed IOs are identified by s->cache_miss, but actually, there are many situations that missed IOs are not assigned a value for s->cache_miss in cached_dev_cache_miss(), for example, a bypassed IO (s->iop.bypass = 1), or the cache_bio allocate failed. In these situations, it will go to out_put or out_submit, and s->cache_miss is null, which leads bch_mark_cache_accounting() to treat this IO as a hit IO. [ML: applied by 3-way merge] Signed-off-by: tang.junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20bcache: explicitly destroy mutex while exitingLiang Chen
[ Upstream commit 330a4db89d39a6b43f36da16824eaa7a7509d34d ] mutex_destroy does nothing most of time, but it's better to call it to make the code future proof and it also has some meaning for like mutex debug. As Coly pointed out in a previous review, bcache_exit() may not be able to handle all the references properly if userspace registers cache and backing devices right before bch_debug_init runs and bch_debug_init failes later. So not exposing userspace interface until everything is ready to avoid that issue. Signed-off-by: Liang Chen <liangchen.linux@gmail.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Coly Li <colyli@suse.de> Reviewed-by: Eric Wheeler <bcache@linux.ewheeler.net> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20dm: fix various targets to dm_register_target after module __init resources ↵monty_pavel@sina.com
created commit 7e6358d244e4706fe612a77b9c36519a33600ac0 upstream. A NULL pointer is seen if two concurrent "vgchange -ay -K <vg name>" processes race to load the dm-thin-pool module: PID: 25992 TASK: ffff883cd7d23500 CPU: 4 COMMAND: "vgchange" #0 [ffff883cd743d600] machine_kexec at ffffffff81038fa9 0000001 [ffff883cd743d660] crash_kexec at ffffffff810c5992 0000002 [ffff883cd743d730] oops_end at ffffffff81515c90 0000003 [ffff883cd743d760] no_context at ffffffff81049f1b 0000004 [ffff883cd743d7b0] __bad_area_nosemaphore at ffffffff8104a1a5 0000005 [ffff883cd743d800] bad_area at ffffffff8104a2ce 0000006 [ffff883cd743d830] __do_page_fault at ffffffff8104aa6f 0000007 [ffff883cd743d950] do_page_fault at ffffffff81517bae 0000008 [ffff883cd743d980] page_fault at ffffffff81514f95 [exception RIP: kmem_cache_alloc+108] RIP: ffffffff8116ef3c RSP: ffff883cd743da38 RFLAGS: 00010046 RAX: 0000000000000004 RBX: ffffffff81121b90 RCX: ffff881bf1e78cc0 RDX: 0000000000000000 RSI: 00000000000000d0 RDI: 0000000000000000 RBP: ffff883cd743da68 R8: ffff881bf1a4eb00 R9: 0000000080042000 R10: 0000000000002000 R11: 0000000000000000 R12: 00000000000000d0 R13: 0000000000000000 R14: 00000000000000d0 R15: 0000000000000246 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 0000009 [ffff883cd743da70] mempool_alloc_slab at ffffffff81121ba5 0000010 [ffff883cd743da80] mempool_create_node at ffffffff81122083 0000011 [ffff883cd743dad0] mempool_create at ffffffff811220f4 0000012 [ffff883cd743dae0] pool_ctr at ffffffffa08de049 [dm_thin_pool] 0000013 [ffff883cd743dbd0] dm_table_add_target at ffffffffa0005f2f [dm_mod] 0000014 [ffff883cd743dc30] table_load at ffffffffa0008ba9 [dm_mod] 0000015 [ffff883cd743dc90] ctl_ioctl at ffffffffa0009dc4 [dm_mod] The race results in a NULL pointer because: Process A (vgchange -ay -K): a. send DM_LIST_VERSIONS_CMD ioctl; b. pool_target not registered; c. modprobe dm_thin_pool and wait until end. Process B (vgchange -ay -K): a. send DM_LIST_VERSIONS_CMD ioctl; b. pool_target registered; c. table_load->dm_table_add_target->pool_ctr; d. _new_mapping_cache is NULL and panic. Note: 1. process A and process B are two concurrent processes. 2. pool_target can be detected by process B but _new_mapping_cache initialization has not ended. To fix dm-thin-pool, and other targets (cache, multipath, and snapshot) with the same problem, simply dm_register_target() after all resources created during module init (as labelled with __init) are finished. Signed-off-by: monty <monty_pavel@sina.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-17md: free unused memory after bitmap resizeZdenek Kabelac
[ Upstream commit 0868b99c214a3d55486c700de7c3f770b7243e7c ] When bitmap is resized, the old kalloced chunks just are not released once the resized bitmap starts to use new space. This fixes in particular kmemleak reports like this one: unreferenced object 0xffff8f4311e9c000 (size 4096): comm "lvm", pid 19333, jiffies 4295263268 (age 528.265s) hex dump (first 32 bytes): 02 80 02 80 02 80 02 80 02 80 02 80 02 80 02 80 ................ 02 80 02 80 02 80 02 80 02 80 02 80 02 80 02 80 ................ backtrace: [<ffffffffa69471ca>] kmemleak_alloc+0x4a/0xa0 [<ffffffffa628c10e>] kmem_cache_alloc_trace+0x14e/0x2e0 [<ffffffffa676cfec>] bitmap_checkpage+0x7c/0x110 [<ffffffffa676d0c5>] bitmap_get_counter+0x45/0xd0 [<ffffffffa676d6b3>] bitmap_set_memory_bits+0x43/0xe0 [<ffffffffa676e41c>] bitmap_init_from_disk+0x23c/0x530 [<ffffffffa676f1ae>] bitmap_load+0xbe/0x160 [<ffffffffc04c47d3>] raid_preresume+0x203/0x2f0 [dm_raid] [<ffffffffa677762f>] dm_table_resume_targets+0x4f/0xe0 [<ffffffffa6774b52>] dm_resume+0x122/0x140 [<ffffffffa6779b9f>] dev_suspend+0x18f/0x290 [<ffffffffa677a3a7>] ctl_ioctl+0x287/0x560 [<ffffffffa677a693>] dm_ctl_ioctl+0x13/0x20 [<ffffffffa62d6b46>] do_vfs_ioctl+0xa6/0x750 [<ffffffffa62d7269>] SyS_ioctl+0x79/0x90 [<ffffffffa6956d41>] entry_SYSCALL_64_fastpath+0x1f/0xc2 Signed-off-by: Zdenek Kabelac <zkabelac@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-17dm raid: fix panic when attempting to force a raid to syncHeinz Mauelshagen
[ Upstream commit 233978449074ca7e45d9c959f9ec612d1b852893 ] Requesting a sync on an active raid device via a table reload (see 'sync' parameter in Documentation/device-mapper/dm-raid.txt) skips the super_load() call that defines the superblock size (rdev->sb_size) -- resulting in an oops if/when super_sync()->memset() is called. Fix by moving the initialization of the superblock start and size out of super_load() to the caller (analyse_superblocks). Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-14md/r5cache: move mddev_lock() out of r5c_journal_mode_set()Song Liu
commit ff35f58e8f8eb520367879a0ccc6f2ec4b62b17b upstream. r5c_journal_mode_set() is called by r5c_journal_mode_store() and raid_ctr() in dm-raid. We don't need mddev_lock() when calling from raid_ctr(). This patch fixes this by moves the mddev_lock() to r5c_journal_mode_store(). Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-05md: forbid a RAID5 from having both a bitmap and a journal.NeilBrown
commit 230b55fa8d64007339319539f8f8e68114d08529 upstream. Having both a bitmap and a journal is pointless. Attempting to do so can corrupt the bitmap if the journal replay happens before the bitmap is initialized. Rather than try to avoid this corruption, simply refuse to allow arrays with both a bitmap and a journal. So: - if raid5_run sees both are present, fail. - if adding a bitmap finds a journal is present, fail - if adding a journal finds a bitmap is present, fail. Signed-off-by: NeilBrown <neilb@suse.com> Tested-by: Joshua Kinard <kumba@gentoo.org> Acked-by: Joshua Kinard <kumba@gentoo.org> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-05bcache: recover data from backing when data is cleanRui Hua
commit e393aa2446150536929140739f09c6ecbcbea7f0 upstream. When we send a read request and hit the clean data in cache device, there is a situation called cache read race in bcache(see the commit in the tail of cache_look_up(), the following explaination just copy from there): The bucket we're reading from might be reused while our bio is in flight, and we could then end up reading the wrong data. We guard against this by checking (in bch_cache_read_endio()) if the pointer is stale again; if so, we treat it as an error (s->iop.error = -EINTR) and reread from the backing device (but we don't pass that error up anywhere) It should be noted that cache read race happened under normal circumstances, not the circumstance when SSD failed, it was counted and shown in /sys/fs/bcache/XXX/internal/cache_read_races. Without this patch, when we use writeback mode, we will never reread from the backing device when cache read race happened, until the whole cache device is clean, because the condition (s->recoverable && (dc && !atomic_read(&dc->has_dirty))) is false in cached_dev_read_error(). In this situation, the s->iop.error(= -EINTR) will be passed up, at last, user will receive -EINTR when it's bio end, this is not suitable, and wield to up-application. In this patch, we use s->read_dirty_data to judge whether the read request hit dirty data in cache device, it is safe to reread data from the backing device when the read request hit clean data. This can not only handle cache read race, but also recover data when failed read request from cache device. [edited by mlyle to fix up whitespace, commit log title, comment spelling] Fixes: d59b23795933 ("bcache: only permit to recovery read error when cache device is clean") Signed-off-by: Hua Rui <huarui.dev@gmail.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-05bcache: only permit to recovery read error when cache device is cleanColy Li
commit d59b23795933678c9638fd20c942d2b4f3cd6185 upstream. When bcache does read I/Os, for example in writeback or writethrough mode, if a read request on cache device is failed, bcache will try to recovery the request by reading from cached device. If the data on cached device is not synced with cache device, then requester will get a stale data. For critical storage system like database, providing stale data from recovery may result an application level data corruption, which is unacceptible. With this patch, for a failed read request in writeback or writethrough mode, recovery a recoverable read request only happens when cache device is clean. That is to say, all data on cached device is up to update. For other cache modes in bcache, read request will never hit cached_dev_read_error(), they don't need this patch. Please note, because cache mode can be switched arbitrarily in run time, a writethrough mode might be switched from a writeback mode. Therefore checking dc->has_data in writethrough mode still makes sense. Changelog: V4: Fix parens error pointed by Michael Lyle. v3: By response from Kent Oversteet, he thinks recovering stale data is a bug to fix, and option to permit it is unnecessary. So this version the sysfs file is removed. v2: rename sysfs entry from allow_stale_data_on_failure to allow_stale_data_on_failure, and fix the confusing commit log. v1: initial patch posted. [small change to patch comment spelling by mlyle] Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Michael Lyle <mlyle@lyle.org> Reported-by: Arne Wolf <awolf@lenovo.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Nix <nix@esperi.org.uk> Cc: Kai Krakow <hurikhan77@gmail.com> Cc: Eric Wheeler <bcache@lists.ewheeler.net> Cc: Junhui Tang <tang.junhui@zte.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-05bcache: Fix building error on MIPSHuacai Chen
commit cf33c1ee5254c6a430bc1538232b49c3ea13e613 upstream. This patch try to fix the building error on MIPS. The reason is MIPS has already defined the PTR macro, which conflicts with the PTR macro in include/uapi/linux/bcache.h. [fixed by mlyle: corrected a line-length issue] Signed-off-by: Huacai Chen <chenhc@lemote.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>