linux-yocto - Yocto Linux Embedded kernel

Age	Commit message (Collapse)	Author
2014-11-14	dm log userspace: fix memory leak in dm_ulog_tfr_init failure path	Alexey Khoroshilov
	commit 56ec16cb1e1ce46354de8511eef962a417c32c92 upstream. If cn_add_callback() fails in dm_ulog_tfr_init(), it does not deallocate prealloced memory but calls cn_del_callback(). Found by Linux Driver Verification project (linuxtesting.org). Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru> Reviewed-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14	dm bufio: when done scanning return from __scan immediately	Mikulas Patocka
	commit 0e825862f3c04cee40e25f55680333728a4ffa9b upstream. When __scan frees the required number of buffer entries that the shrinker requested (nr_to_scan becomes zero) it must return. Before this fix the __scan code exited only the inner loop and continued in the outer loop -- which could result in reduced performance due to extra buffers being freed (e.g. unnecessarily evicted thinp metadata needing to be synchronously re-read into bufio's cache). Also, move dm_bufio_cond_resched to __scan's inner loop, so that iterating the bufio client's lru lists doesn't result in scheduling latency. Reported-by: Joe Thornber <thornber@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14	dm bufio: update last_accessed when relinking a buffer	Joe Thornber
	commit eb76faf53b1ff7a77ce3f78cc98ad392ac70c2a0 upstream. The 'last_accessed' member of the dm_buffer structure was only set when the the buffer was created. This led to each buffer being discarded after dm_bufio_max_age time even if it was used recently. In practice this resulted in all thinp metadata being evicted soon after being read -- this is particularly problematic for metadata intensive workloads like multithreaded small random IO. 'last_accessed' is now updated each time the buffer is moved to the head of the LRU list, so the buffer is now properly discarded if it was not used in dm_bufio_max_age time. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-09	md/raid5: disable 'DISCARD' by default due to safety concerns.	NeilBrown
	commit 8e0e99ba64c7ba46133a7c8a3e3f7de01f23bd93 upstream. It has come to my attention (thanks Martin) that 'discard_zeroes_data' is only a hint. Some devices in some cases don't do what it says on the label. The use of DISCARD in RAID5 depends on reads from discarded regions being predictably zero. If a write to a previously discarded region performs a read-modify-write cycle it assumes that the parity block was consistent with the data blocks. If all were zero, this would be the case. If some are and some aren't this would not be the case. This could lead to data corruption after a device failure when data needs to be reconstructed from the parity. As we cannot trust 'discard_zeroes_data', ignore it by default and so disallow DISCARD on all raid4/5/6 arrays. As many devices are trustworthy, and as there are benefits to using DISCARD, add a module parameter to over-ride this caution and cause DISCARD to work if discard_zeroes_data is set. If a site want to enable DISCARD on some arrays but not on others they should select DISCARD support at the filesystem level, and set the raid456 module parameter. raid456.devices_handle_discard_safely=Y As this is a data-safety issue, I believe this patch is suitable for -stable. DISCARD support for RAID456 was added in 3.7 Cc: Shaohua Li <shli@kernel.org> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Heinz Mauelshagen <heinzm@redhat.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Fixes: 620125f2bf8ff0c4969b79653b54d7bcc9d40637 Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-05	md/raid1: intialise start_next_window for READ case to avoid hang	NeilBrown
	commit f0cc9a057151892b885be21a1d19b0185568281d upstream. r1_bio->start_next_window is not initialised in the READ case, so allow_barrier may incorrectly decrement conf->current_window_requests which can cause raise_barrier() to block forever. Fixes: 79ef3a8aa1cb1523cc231c9a90a278333c21f761 Reported-by: Brassow Jonathan <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-05	md/raid1: fix_read_error should act on all non-faulty devices.	NeilBrown
	commit b8cb6b4c121e1bf1963c16ed69e7adcb1bc301cd upstream. If a devices is being recovered it is not InSync and is not Faulty. If a read error is experienced on that device, fix_read_error() will be called, but it ignores non-InSync devices. So it will neither fix the error nor fail the device. It is incorrect that fix_read_error() ignores non-InSync devices. It should only ignore Faulty devices. So fix it. This became a bug when we allowed reading from a device that was being recovered. It is suitable for any subsequent -stable kernel. Fixes: da8840a747c0dbf49506ec906757a6b87b9741e9 Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com> Tested-by: Alexander Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-05	md/raid1: count resync requests in nr_pending.	NeilBrown
	commit 34e97f170149bfa14979581c4c748bc9b4b79d5b upstream. Both normal IO and resync IO can be retried with reschedule_retry() and so be counted into ->nr_queued, but only normal IO gets counted in ->nr_pending. Before the recent improvement to RAID1 resync there could only possibly have been one or the other on the queue. When handling a read failure it could only be normal IO. So when handle_read_error() called freeze_array() the fact that freeze_array only compares ->nr_queued against ->nr_pending was safe. But now that these two types can interleave, we can have both normal and resync IO requests queued, so we need to count them both in nr_pending. This error can lead to freeze_array() hanging if there is a read error, so it is suitable for -stable. Fixes: 79ef3a8aa1cb1523cc231c9a90a278333c21f761 Reported-by: Brassow Jonathan <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-05	md/raid1: update next_resync under resync_lock.	NeilBrown
	commit c2fd4c94deedb89ac1746c4a53219be499372c06 upstream. raise_barrier() uses next_resync as part of its calculations, so it really should be updated first, instead of afterwards. next_resync is always used under resync_lock so update it under resync lock to, just before it is used. That is safest. This could cause normal IO and resync IO to interact badly so it suitable for -stable. Fixes: 79ef3a8aa1cb1523cc231c9a90a278333c21f761 Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-05	md/raid1: Don't use next_resync to determine how far resync has progressed	NeilBrown
	commit 235549605eb7f1c5a37cef8b09d12e6d412c5cd6 upstream. next_resync is (approximately) the location for the next resync request. However it does not reliably determine the earliest location at which resync might be happening. This is because resync requests can complete out of order, and we only limit the number of current requests, not the distance from the earliest pending request to the latest. mddev->curr_resync_completed is a reliable indicator of the earliest position at which resync could be happening. It is updated less frequently, but is actually reliable which is more important. So use it to determine if a write request is before the region being resynced and so safe from conflict. This error can allow resync IO to interfere with normal IO which could lead to data corruption. Hence: stable. Fixes: 79ef3a8aa1cb1523cc231c9a90a278333c21f761 Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-05	md/raid1: make sure resync waits for conflicting writes to complete.	NeilBrown
	commit 2f73d3c55d09ce60647b96ad2a9b539c95a530ee upstream. The resync/recovery process for raid1 was recently changed so that writes could happen in parallel with resync providing they were in different regions of the device. There is a problem though: While a write request will always wait for conflicting resync to complete, a resync request will not always wait for conflicting writes to complete. Two changes are needed to fix this: 1/ raise_barrier (which waits until it is safe to do resync) must wait until current_window_requests is zero 2/ wait_battier (which waits at the start of a new write request) must update current_window_requests if the request could possible conflict with a concurrent resync. As concurrent writes and resync can lead to data loss, this patch is suitable for -stable. Fixes: 79ef3a8aa1cb1523cc231c9a90a278333c21f761 Cc: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-05	md/raid1: be more cautious where we read-balance during resync.	NeilBrown
	commit c6d119cf1b5a778e9ed60a006e2a434fcc4471a2 upstream. commit 79ef3a8aa1cb1523cc231c9a90a278333c21f761 made it possible for reads to happen concurrently with resync. This means that we need to be more careful where read_balancing is allowed during resync - we can no longer be sure that any resync that has already started will definitely finish. So keep read_balancing to before recovery_cp, which is conservative but safe. This bug makes it possible to read from a device that doesn't have up-to-date data, so it can cause data corruption. So it is suitable for any kernel since 3.11. Fixes: 79ef3a8aa1cb1523cc231c9a90a278333c21f761 Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-05	md/raid1: clean up request counts properly in close_sync()	NeilBrown
	commit 669cc7ba77864e7b1ac39c9f2b2afb8730f341f4 upstream. If there are outstanding writes when close_sync is called, the change to ->start_next_window might cause them to decrement the wrong counter when they complete. Fix this by merging the two counters into the one that will be decremented. Having an incorrect value in a counter can cause raise_barrier() to hangs, so this is suitable for -stable. Fixes: 79ef3a8aa1cb1523cc231c9a90a278333c21f761 Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-05	dm crypt: fix access beyond the end of allocated space	Mikulas Patocka
	commit d49ec52ff6ddcda178fc2476a109cf1bd1fa19ed upstream. The DM crypt target accesses memory beyond allocated space resulting in a crash on 32 bit x86 systems. This bug is very old (it dates back to 2.6.25 commit 3a7f6c990ad04 "dm crypt: use async crypto"). However, this bug was masked by the fact that kmalloc rounds the size up to the next power of two. This bug wasn't exposed until 3.17-rc1 commit 298a9fa08a ("dm crypt: use per-bio data"). By switching to using per-bio data there was no longer any padding beyond the end of a dm-crypt allocated memory block. To minimize allocation overhead dm-crypt puts several structures into one block allocated with kmalloc. The block holds struct ablkcipher_request, cipher-specific scratch pad (crypto_ablkcipher_reqsize(any_tfm(cc))), struct dm_crypt_request and an initialization vector. The variable dmreq_start is set to offset of struct dm_crypt_request within this memory block. dm-crypt allocates the block with this size: cc->dmreq_start + sizeof(struct dm_crypt_request) + cc->iv_size. When accessing the initialization vector, dm-crypt uses the function iv_of_dmreq, which performs this calculation: ALIGN((unsigned long)(dmreq + 1), crypto_ablkcipher_alignmask(any_tfm(cc)) + 1). dm-crypt allocated "cc->iv_size" bytes beyond the end of dm_crypt_request structure. However, when dm-crypt accesses the initialization vector, it takes a pointer to the end of dm_crypt_request, aligns it, and then uses it as the initialization vector. If the end of dm_crypt_request is not aligned on a crypto_ablkcipher_alignmask(any_tfm(cc)) boundary the alignment causes the initialization vector to point beyond the allocated space. Fix this bug by calculating the variable iv_size_padding and adding it to the allocated size. Also correct the alignment of dm_crypt_request. struct dm_crypt_request is specific to dm-crypt (it isn't used by the crypto subsystem at all), so it is aligned on __alignof__(struct dm_crypt_request). Also align per_bio_data_size on ARCH_KMALLOC_MINALIGN, so that it is aligned as if the block was allocated with kmalloc. Reported-by: Krzysztof Kolasa <kkolasa@winsoft.pl> Tested-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-05	dm cache: fix race causing dirty blocks to be marked as clean	Anssi Hannula
	commit 40aa978eccec61347cd47b97c598df49acde8be5 upstream. When a writeback or a promotion of a block is completed, the cell of that block is removed from the prison, the block is marked as clean, and the clear_dirty() callback of the cache policy is called. Unfortunately, performing those actions in this order allows an incoming new write bio for that block to come in before clearing the dirty status is completed and therefore possibly causing one of these two scenarios: Scenario A: Thread 1 Thread 2 cell_defer() . - cell removed from prison . - detained bios queued . . incoming write bio . remapped to cache . set_dirty() called, . but block already dirty . => it does nothing clear_dirty() . - block marked clean . - policy clear_dirty() called . Result: Block is marked clean even though it is actually dirty. No writeback will occur. Scenario B: Thread 1 Thread 2 cell_defer() . - cell removed from prison . - detained bios queued . clear_dirty() . - block marked clean . . incoming write bio . remapped to cache . set_dirty() called . - block marked dirty . - policy set_dirty() called - policy clear_dirty() called . Result: Block is properly marked as dirty, but policy thinks it is clean and therefore never asks us to writeback it. This case is visible in "dmsetup status" dirty block count (which normally decreases to 0 on a quiet device). Fix these issues by calling clear_dirty() before calling cell_defer(). Incoming bios for that block will then be detained in the cell and released only after clear_dirty() has completed, so the race will not occur. Found by inspecting the code after noticing spurious dirty counts (scenario B). Signed-off-by: Anssi Hannula <anssi.hannula@iki.fi> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-17	md/raid10: Fix memory leak when raid10 reshape completes.	NeilBrown
	commit b39685526f46976bcd13aa08c82480092befa46c upstream. When a raid10 commences a resync/recovery/reshape it allocates some buffer space. When a resync/recovery completes the buffer space is freed. But not when the reshape completes. This can result in a small memory leak. There is a subtle side-effect of this bug. When a RAID10 is reshaped to a larger array (more devices), the reshape is immediately followed by a "resync" of the new space. This "resync" will use the buffer space which was allocated for "reshape". This can cause problems including a "BUG" in the SCSI layer. So this is suitable for -stable. Fixes: 3ea7daa5d7fde47cd41f4d56c2deb949114da9d6 Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-17	md/raid10: fix memory leak when reshaping a RAID10.	NeilBrown
	commit ce0b0a46955d1bb389684a2605dbcaa990ba0154 upstream. raid10 reshape clears unwanted bits from a bio->bi_flags using a method which, while clumsy, worked until 3.10 when BIO_OWNS_VEC was added. Since then it clears that bit but shouldn't. This results in a memory leak. So change to used the approved method of clearing unwanted bits. As this causes a memory leak which can consume all of memory the fix is suitable for -stable. Fixes: a38352e0ac02dbbd4fa464dc22d1352b5fbd06fd Reported-by: mdraid.pkoch@dfgh.net (Peter Koch) Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-17	md/raid6: avoid data corruption during recovery of double-degraded RAID6	NeilBrown
	commit 9c4bdf697c39805078392d5ddbbba5ae5680e0dd upstream. During recovery of a double-degraded RAID6 it is possible for some blocks not to be recovered properly, leading to corruption. If a write happens to one block in a stripe that would be written to a missing device, and at the same time that stripe is recovering data to the other missing device, then that recovered data may not be written. This patch skips, in the double-degraded case, an optimisation that is only safe for single-degraded arrays. Bug was introduced in 2.6.32 and fix is suitable for any kernel since then. In an older kernel with separate handle_stripe5() and handle_stripe6() functions the patch must change handle_stripe6(). Fixes: 6c0069c0ae9659e3a91b68eaed06a5c6c37f45c8 Cc: Yuri Tikhonov <yur@emcraft.com> Cc: Dan Williams <dan.j.williams@intel.com> Reported-by: "Manibalan P" <pmanibalan@amiindia.co.in> Tested-by: "Manibalan P" <pmanibalan@amiindia.co.in> Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1090423 Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-17	md/raid1,raid10: always abort recover on write error.	NeilBrown
	commit 2446dba03f9dabe0b477a126cbeb377854785b47 upstream. Currently we don't abort recovery on a write error if the write error to the recovering device was triggerd by normal IO (as opposed to recovery IO). This means that for one bitmap region, the recovery might write to the recovering device for a few sectors, then not bother for subsequent sectors (as it never writes to failed devices). In this case the bitmap bit will be cleared, but it really shouldn't. The result is that if the recovering device fails and is then re-added (after fixing whatever hardware problem triggerred the failure), the second recovery won't redo the region it was in the middle of, so some of the device will not be recovered properly. If we abort the recovery, the region being processes will be cancelled (bit not cleared) and the whole region will be retried. As the bug can result in data corruption the patch is suitable for -stable. For kernels prior to 3.11 there is a conflict in raid10.c which will require care. Original-from: jiao hui <jiaohui@bwstor.com.cn> Reported-and-tested-by: jiao hui <jiaohui@bwstor.com.cn> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-08-07	dm cache: fix race affecting dirty block count	Anssi Hannula
	commit 44fa816bb778edbab6b6ddaaf24908dd6295937e upstream. nr_dirty is updated without locking, causing it to drift so that it is non-zero (either a small positive integer, or a very large one when an underflow occurs) even when there are no actual dirty blocks. This was due to a race between the workqueue and map function accessing nr_dirty in parallel without proper protection. People were seeing under runs due to a race on increment/decrement of nr_dirty, see: https://lkml.org/lkml/2014/6/3/648 Fix this by using an atomic_t for nr_dirty. Reported-by: roma1390@gmail.com Signed-off-by: Anssi Hannula <anssi.hannula@iki.fi> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-08-07	dm bufio: fully initialize shrinker	Greg Thelen
	commit d8c712ea471ce7a4fd1734ad2211adf8469ddddc upstream. 1d3d4437eae1 ("vmscan: per-node deferred work") added a flags field to struct shrinker assuming that all shrinkers were zero filled. The dm bufio shrinker is not zero filled, which leaves arbitrary kmalloc() data in flags. So far the only defined flags bit is SHRINKER_NUMA_AWARE. But there are proposed patches which add other bits to shrinker.flags (e.g. memcg awareness). Rather than simply initializing the shrinker, this patch uses kzalloc() when allocating the dm_bufio_client to ensure that the embedded shrinker and any other similar structures are zeroed. This fixes theoretical over aggressive shrinking of dm bufio objects. If the uninitialized dm_bufio_client.shrinker.flags contains SHRINKER_NUMA_AWARE then shrink_slab() would call the dm shrinker for each numa node rather than just once. This has been broken since 3.12. Signed-off-by: Greg Thelen <gthelen@google.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-28	dm cache metadata: do not allow the data block size to change	Mike Snitzer
	commit 048e5a07f282c57815b3901d4a68a77fa131ce0a upstream. The block size for the dm-cache's data device must remained fixed for the life of the cache. Disallow any attempt to change the cache's data block size. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-28	dm thin metadata: do not allow the data block size to change	Mike Snitzer
	commit 9aec8629ec829fc9403788cd959e05dd87988bd1 upstream. The block size for the thin-pool's data device must remained fixed for the life of the thin-pool. Disallow any attempt to change the thin-pool's data block size. It should be noted that attempting to change the data block size via thin-pool table reload will be ignored as a side-effect of the thin-pool handover that the thin-pool target does during thin-pool table reload. Here is an example outcome of attempting to load a thin-pool table that reduced the thin-pool's data block size from 1024K to 512K. Before: kernel: device-mapper: thin: 253:4: growing the data device from 204800 to 409600 blocks After: kernel: device-mapper: thin metadata: changing the data block size (from 2048 to 1024) is not supported kernel: device-mapper: table: 253:4: thin-pool: Error creating metadata object kernel: device-mapper: ioctl: error adding target to table Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-17	dm: allocate a special workqueue for deferred device removal	Mikulas Patocka
	commit acfe0ad74d2e1bfc81d1d7bf5e15b043985d3650 upstream. The commit 2c140a246dc ("dm: allow remove to be deferred") introduced a deferred removal feature for the device mapper. When this feature is used (by passing a flag DM_DEFERRED_REMOVE to DM_DEV_REMOVE_CMD ioctl) and the user tries to remove a device that is currently in use, the device will be removed automatically in the future when the last user closes it. Device mapper used the system workqueue to perform deferred removals. However, some targets (dm-raid1, dm-mpath, dm-stripe) flush work items scheduled for the system workqueue from their destructor. If the destructor itself is called from the system workqueue during deferred removal, it introduces a possible deadlock - the workqueue tries to flush itself. Fix this possible deadlock by introducing a new workqueue for deferred removals. We allocate just one workqueue for all dm targets. The ability of dm targets to process IOs isn't dependent on deferred removal of unused targets, so a deadlock due to shared workqueue isn't possible. Also, cleanup local_init() to eliminate potential for returning success on failure. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-17	dm io: fix a race condition in the wake up code for sync_io	Joe Thornber
	commit 10f1d5d111e8aed46a0f1179faf9a3cf422f689e upstream. There's a race condition between the atomic_dec_and_test(&io->count) in dec_count() and the waking of the sync_io() thread. If the thread is spuriously woken immediately after the decrement it may exit, making the on stack io struct invalid, yet the dec_count could still be using it. Fix this race by using a completion in sync_io() and dec_count(). Reported-by: Minfei Huang <huangminfei@ucloud.cn> Signed-off-by: Joe Thornber <thornber@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-09	md: flush writes before starting a recovery.	NeilBrown
	commit 133d4527eab8d199a62eee6bd433f0776842df2e upstream. When we write to a degraded array which has a bitmap, we make sure the relevant bit in the bitmap remains set when the write completes (so a 're-add' can quickly rebuilt a temporarily-missing device). If, immediately after such a write starts, we incorporate a spare, commence recovery, and skip over the region where the write is happening (because the 'needs recovery' flag isn't set yet), then that write will not get to the new device. Once the recovery finishes the new device will be trusted, but will have incorrect data, leading to possible corruption. We cannot set the 'needs recovery' flag when we start the write as we do not know easily if the write will be "degraded" or not. That depends on details of the particular raid level and particular write request. This patch fixes a corruption issue of long standing and so it suitable for any -stable kernel. It applied correctly to 3.0 at least and will minor editing to earlier kernels. Reported-by: Bill <billstuff2001@sbcglobal.net> Tested-by: Bill <billstuff2001@sbcglobal.net> Link: http://lkml.kernel.org/r/53A518BB.60709@sbcglobal.net Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-09	dm thin: update discard_granularity to reflect the thin-pool blocksize	Lukas Czerner
	commit 09869de57ed2728ae3c619803932a86cb0e2c4f8 upstream. DM thinp already checks whether the discard_granularity of the data device is a factor of the thin-pool block size. But when using the dm-thin-pool's discard passdown support, DM thinp was not selecting the max of the underlying data device's discard_granularity and the thin-pool's block size. Update set_discard_limits() to set discard_granularity to the max of these values. This enables blkdev_issue_discard() to properly align the discards that are sent to the DM thin device on a full block boundary. As such each discard will now cover an entire DM thin-pool block and the block will be reclaimed. Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-11	md: always set MD_RECOVERY_INTR when interrupting a reshape thread.	NeilBrown
	commit 2ac295a544dcae9299cba13ce250419117ae7fd1 upstream. Commit 8313b8e57f55b15e5b7f7fc5d1630bbf686a9a97 md: fix problem when adding device to read-only array with bitmap. added a called to md_reap_sync_thread() which cause a reshape thread to be interrupted (in particular, it could cause md_thread() to never even call md_do_sync()). However it didn't set MD_RECOVERY_INTR so ->finish_reshape() would not know that the reshape didn't complete. This only happens when mddev->ro is set and normally reshape threads don't run in that situation. But raid5 and raid10 can start a reshape thread during "run" is the array is in the middle of a reshape. They do this even if ->ro is set. So it is best to set MD_RECOVERY_INTR before abortingg the sync thread, just in case. Though it rare for this to trigger a problem it can cause data corruption because the reshape isn't finished properly. So it is suitable for any stable which the offending commit was applied to. (3.2 or later) Fixes: 8313b8e57f55b15e5b7f7fc5d1630bbf686a9a97 Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-11	md: always set MD_RECOVERY_INTR when aborting a reshape or other "resync".	NeilBrown
	commit 3991b31ea072b070081ca3bfa860a077eda67de5 upstream. If mddev->ro is set, md_to_sync will (correctly) abort. However in that case MD_RECOVERY_INTR isn't set. If a RESHAPE had been requested, then ->finish_reshape() will be called and it will think the reshape was successful even though nothing happened. Normally a resync will not be requested if ->ro is set, but if an array is stopped while a reshape is on-going, then when the array is started, the reshape will be restarted. If the array is also set read-only at this point, the reshape will instantly appear to success, resulting in data corruption. Consequently, this patch is suitable for any -stable kernel. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-11	dm cache: always split discards on cache block boundaries	Heinz Mauelshagen
	commit f1daa838e861ae1a0fb7cd9721a21258430fcc8c upstream. The DM cache target cannot cope with discards that span multiple cache blocks, so each discard bio that spans more than one cache block must get split by the DM core. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-11	dm thin: add 'no_space_timeout' dm-thin-pool module param	Mike Snitzer
	commit 80c578930ce77ba8bcfb226a184b482020bdda7b upstream. Commit 85ad643b ("dm thin: add timeout to stop out-of-data-space mode holding IO forever") introduced a fixed 60 second timeout. Users may want to either disable or modify this timeout. Allow the out-of-data-space timeout to be configured using the 'no_space_timeout' dm-thin-pool module param. Setting it to 0 will disable the timeout, resulting in IO being queued until more data space is added to the thin-pool. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07	dm thin: add timeout to stop out-of-data-space mode holding IO forever	Joe Thornber
	commit 85ad643b7e7e52d37620fb272a9fd577a8095647 upstream. If the pool runs out of data space, dm-thin can be configured to either error IOs that would trigger provisioning, or hold those IOs until the pool is resized. Unfortunately, holding IOs until the pool is resized can result in a cascade of tasks hitting the hung_task_timeout, which may render the system unavailable. Add a fixed timeout so IOs can only be held for a maximum of 60 seconds. If LVM is going to resize a thin-pool that is out of data space it needs to be prompt about it. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07	dm thin: allow metadata commit if pool is in PM_OUT_OF_DATA_SPACE mode	Joe Thornber
	commit 8d07e8a5f5bc7b90f755d9b427ea930024f4c986 upstream. Commit 3e1a0699 ("dm thin: fix out of data space handling") introduced a regression in the metadata commit() method by returning an error if the pool is in PM_OUT_OF_DATA_SPACE mode. This oversight caused a thin device to return errors even if the default queue_if_no_space ENOSPC handling mode is used. Fix commit() to only fail if pool is in PM_READ_ONLY or PM_FAIL mode. Reported-by: qindehua@163.com Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07	dm crypt: fix cpu hotplug crash by removing per-cpu structure	Mikulas Patocka
	commit 610f2de3559c383caf8fbbf91e9968102dff7ca0 upstream. The DM crypt target used per-cpu structures to hold pointers to a ablkcipher_request structure. The code assumed that the work item keeps executing on a single CPU, so it didn't use synchronization when accessing this structure. If a CPU is disabled by writing 0 to /sys/devices/system/cpu/cpu*/online, the work item could be moved to another CPU. This causes dm-crypt crashes, like the following, because the code starts using an incorrect ablkcipher_request: smpboot: CPU 7 is now offline BUG: unable to handle kernel NULL pointer dereference at 0000000000000130 IP: [<ffffffffa1862b3d>] crypt_convert+0x12d/0x3c0 [dm_crypt] ... Call Trace: [<ffffffffa1864415>] ? kcryptd_crypt+0x305/0x470 [dm_crypt] [<ffffffff81062060>] ? finish_task_switch+0x40/0xc0 [<ffffffff81052a28>] ? process_one_work+0x168/0x470 [<ffffffff8105366b>] ? worker_thread+0x10b/0x390 [<ffffffff81053560>] ? manage_workers.isra.26+0x290/0x290 [<ffffffff81058d9f>] ? kthread+0xaf/0xc0 [<ffffffff81058cf0>] ? kthread_create_on_node+0x120/0x120 [<ffffffff813464ac>] ? ret_from_fork+0x7c/0xb0 [<ffffffff81058cf0>] ? kthread_create_on_node+0x120/0x120 Fix this bug by removing the per-cpu definition. The structure ablkcipher_request is accessed via a pointer from convert_context. Consequently, if the work item is rescheduled to a different CPU, the thread still uses the same ablkcipher_request. This change may undermine performance improvements intended by commit c0297721 ("dm crypt: scale to multiple cpus") on select hardware. In practice no performance difference was observed on recent hardware. But regardless, correctness is more important than performance. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07	md: avoid possible spinning md thread at shutdown.	NeilBrown
	commit 0f62fb220aa4ebabe8547d3a9ce4a16d3c045f21 upstream. If an md array with externally managed metadata (e.g. DDF or IMSM) is in use, then we should not set safemode==2 at shutdown because: 1/ this is ineffective: user-space need to be involved in any 'safemode' handling, 2/ The safemode management code doesn't cope with safemode==2 on external metadata and md_check_recover enters an infinite loop. Even at shutdown, an infinite-looping process can be problematic, so this could cause shutdown to hang. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07	md/raid10: call wait_barrier() for each request submitted.	NeilBrown
	commit cc13b1d1500656a20e41960668f3392dda9fa6e2 upstream. wait_barrier() includes a counter, so we must call it precisely once (unless balanced by allow_barrier()) for each request submitted. Since commit 20d0189b1012a37d2533a87fb451f7852f2418d1 block: Introduce new bio_split() in 3.14-rc1, we don't call it for the extra requests generated when we need to split a bio. When this happens the counter goes negative, any resync/recovery will never start, and "mdadm --stop" will hang. Reported-by: Chris Murphy <lists@colorremedies.com> Fixes: 20d0189b1012a37d2533a87fb451f7852f2418d1 Cc: Kent Overstreet <kmo@daterainc.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07	dm cache: fix writethrough mode quiescing in cache_map	Mike Snitzer
	commit 131cd131a9ff63d4b84f3fe15073a2984ac30066 upstream. Commit 2ee57d58735 ("dm cache: add passthrough mode") inadvertently removed the deferred set reference that was taken in cache_map()'s writethrough mode support. Restore taking this reference. This issue was found with code inspection. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07	dm verity: fix biovecs hash calculation regression	Milan Broz
	commit 3a7745215e7f73a5c7d9bcdc50661a55b39052a3 upstream. Commit 003b5c5719f159f4f4bf97511c4702a0638313dd ("block: Convert drivers to immutable biovecs") incorrectly converted biovec iteration in dm-verity to always calculate the hash from a full biovec, but the function only needs to calculate the hash from part of the biovec (up to the calculated "todo" value). Fix this issue by limiting hash input to only the requested data size. This problem was identified using the cryptsetup regression test for veritysetup (verity-compat-test). Signed-off-by: Milan Broz <gmazyland@gmail.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-31	md/raid1: r1buf_pool_alloc: free allocate pages when subsequent allocation ↵	NeilBrown
	fails. commit da1aab3dca9aa88ae34ca392470b8943159e25fe upstream. When performing a user-request check/repair (MD_RECOVERY_REQUEST is set) on a raid1, we allocate multiple bios each with their own set of pages. If the page allocations for one bio fails, we currently do not free the pages allocated for the previous bios, nor do we free the bio itself. This patch frees all the already-allocate pages, and makes sure that all the bios are freed as well. This bug can cause a memory leak which can ultimately OOM a machine. It was introduced in 3.10-rc1. Fixes: a07876064a0b73ab5ef1ebcf14b1cf0231c07858 Cc: Kent Overstreet <koverstreet@google.com> Reported-by: Russell King - ARM Linux <linux@arm.linux.org.uk> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-13	dm cache: fix a lock-inversion	Joe Thornber
	commit 0596661f0a16d9d69bf1033320e70b6ff52b5e81 upstream. When suspending a cache the policy is walked and the individual policy hints written to the metadata via sync_metadata(). This led to this lock order: policy->lock cache_metadata->root_lock When loading the cache target the policy is populated while the metadata lock is held: cache_metadata->root_lock policy->lock Fix this potential lock-inversion (ABBA) deadlock in sync_metadata() by ensuring the cache_metadata root_lock is held whilst all the hints are written, rather than being repeatedly locked while policy->lock is held (as was the case with each callout that policy_walk_mappings() made to the old save_hint() method). Found by turning on the CONFIG_PROVE_LOCKING ("Lock debugging: prove locking correctness") build option. However, it is not clear how the LOCKDEP reported paths can lead to a deadlock since the two paths, suspending a target and loading a target, never occur at the same time. But that doesn't mean the same lock-inversion couldn't have occurred elsewhere. Reported-by: Marian Csontos <mcsontos@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-13	dm thin: fix dangling bio in process_deferred_bios error path	Mike Snitzer
	commit fe76cd88e654124d1431bb662a0fc6e99ca811a5 upstream. If unable to ensure_next_mapping() we must add the current bio, which was removed from the @bios list via bio_list_pop, back to the deferred_bios list before all the remaining @bios. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-13	dm: take care to copy the space map roots before locking the superblock	Joe Thornber
	commit 5a32083d03fb543f63489b2946c4948398579ba0 upstream. In theory copying the space map root can fail, but in practice it never does because we're careful to check what size buffer is needed. But make certain we're able to copy the space map roots before locking the superblock. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-13	dm transaction manager: fix corruption due to non-atomic transaction commit	Joe Thornber
	commit a9d45396f5956d0b615c7ae3b936afd888351a47 upstream. The persistent-data library used by dm-thin, dm-cache, etc is transactional. If anything goes wrong, such as an io error when writing new metadata or a power failure, then we roll back to the last transaction. Atomicity when committing a transaction is achieved by: a) Never overwriting data from the previous transaction. b) Writing the superblock last, after all other metadata has hit the disk. This commit and the following commit ("dm: take care to copy the space map roots before locking the superblock") fix a bug associated with (b). When committing it was possible for the superblock to still be written in spite of an io error occurring during the preceeding metadata flush. With these commits we're careful not to take the write lock out on the superblock until after the metadata flush has completed. Change the transaction manager's semantics for dm_tm_commit() to assume all data has been flushed _before_ the single superblock that is passed in. As a prerequisite, split the block manager's block unlocking and flushing by simplifying dm_bm_flush_and_unlock() to dm_bm_flush(). Now the unlocking must be done separately. This issue was discovered by forcing io errors at the crucial time using dm-flakey. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-13	dm cache: prevent corruption caused by discard_block_size > cache_block_size	Mike Snitzer
	commit d132cc6d9e92424bb9d4fd35f5bd0e55d583f4be upstream. If the discard block size is larger than the cache block size we will not properly quiesce IO to a region that is about to be discarded. This results in a race between a cache migration where no copy is needed, and a write to an adjacent cache block that's within the same large discard block. Workaround this by limiting the discard_block_size to cache_block_size. Also limit the max_discard_sectors to cache_block_size. A more comprehensive fix that introduces range locking support in the bio_prison and proper quiescing of a discard range that spans multiple cache blocks is already in development. Reported-by: Morgan Mears <Morgan.Mears@netapp.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Acked-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-12	dm cache: fix access beyond end of origin device	Heinz Mauelshagen
	In order to avoid wasting cache space a partial block at the end of the origin device is not cached. Unfortunately, the check for such a partial block at the end of the origin device was flawed. Fix accesses beyond the end of the origin device that occured due to attempted promotion of an undetected partial block by: - initializing the per bio data struct to allow cache_end_io to work properly - recognizing access to the partial block at the end of the origin device - avoiding out of bounds access to the discard bitset Otherwise, users can experience errors like the following: attempt to access beyond end of device dm-5: rw=0, want=20971520, limit=20971456 ... device-mapper: cache: promotion failed; couldn't copy block Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2014-03-12	dm cache: fix truncation bug when copying a block to/from >2TB fast device	Heinz Mauelshagen
	During demotion or promotion to a cache's >2TB fast device we must not truncate the cache block's associated sector to 32bits. The 32bit temporary result of from_cblock() caused a 32bit multiplication when calculating the sector of the fast device in issue_copy_real(). Use an intermediate 64bit type to store the 32bit from_cblock() to allow for proper 64bit multiplication. Here is an example of how this bug manifests on an ext4 filesystem: EXT4-fs error (device dm-0): ext4_mb_generate_buddy:756: group 17136, 32768 clusters in bitmap, 30688 in gd; block bitmap corrupt. JBD2: Spotted dirty metadata buffer (dev = dm-0, blocknr = 0). There's a risk of filesystem corruption in case of system crash. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
2014-03-07	dm space map metadata: fix refcount decrement below 0 which caused corruption	Joe Thornber
	This has been a relatively long-standing issue that wasn't nailed down until Teng-Feng Yang's meticulous bug report to dm-devel on 3/7/2014, see: http://www.redhat.com/archives/dm-devel/2014-March/msg00021.html From that report: "When decreasing the reference count of a metadata block with its reference count equals 3, we will call dm_btree_remove() to remove this enrty from the B+tree which keeps the reference count info in metadata device. The B+tree will try to rebalance the entry of the child nodes in each node it traversed, and the rebalance process contains the following steps. (1) Finding the corresponding children in current node (shadow_current(s)) (2) Shadow the children block (issue BOP_INC) (3) redistribute keys among children, and free children if necessary (issue BOP_DEC) Since the update of a metadata block's reference count could be recursive, we will stash these reference count update operations in smm->uncommitted and then process them in a FILO fashion. The problem is that step(3) could free the children which is created in step(2), so the BOP_DEC issued in step(3) will be carried out before the BOP_INC issued in step(2) since these BOPs will be processed in FILO fashion. Once the BOP_DEC from step(3) tries to decrease the reference count of newly shadow block, it will report failure for its reference equals 0 before decreasing. It looks like we can solve this issue by processing these BOPs in a FIFO fashion instead of FILO." Commit 5b564d80 ("dm space map: disallow decrementing a reference count below zero") changed the code to report an error for this temporary refcount decrement below zero. So what was previously a harmless invalid refcount became a hard failure due to the new error path: device-mapper: space map common: unable to decrement a reference count below 0 device-mapper: thin: 253:6: dm_thin_insert_block() failed: error = -22 device-mapper: thin: 253:6: switching pool to read-only mode This bug is in dm persistent-data code that is common to the DM thin and cache targets. So any users of those targets should apply this fix. Fix this by applying recursive space map operations in FIFO order rather than FILO. Resolves: https://bugzilla.kernel.org/show_bug.cgi?id=68801 Reported-by: Apollon Oikonomopoulos <apoikos@debian.org> Reported-by: edwillam1007@gmail.com Reported-by: Teng-Feng Yang <shinrairis@gmail.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.13+
2014-03-05	dm thin: fix noflush suspend IO queueing	Joe Thornber
	i) by the time DM core calls the postsuspend hook the dm_noflush flag has been cleared. So the old thin_postsuspend did nothing. We need to use the presuspend hook instead. ii) There was a race between bios leaving DM core and arriving in the deferred queue. thin_presuspend now sets a 'requeue' flag causing all bios destined for that thin to be requeued back to DM core. Then it requeues all held IO, and all IO on the deferred queue (destined for that thin). Finally postsuspend clears the 'requeue' flag. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-05	dm thin: fix deadlock in __requeue_bio_list	Joe Thornber
	The spin lock in requeue_io() was held for too long, allowing deadlock. Don't worry, due to other issues addressed in the following "dm thin: fix noflush suspend IO queueing" commit, this code was never called. Fix this by taking the spin lock for a much shorter period of time. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-05	dm thin: fix out of data space handling	Joe Thornber
	Ideally a thin pool would never run out of data space; the low water mark would trigger userland to extend the pool before we completely run out of space. However, many small random IOs to unprovisioned space can consume data space at an alarming rate. Adjust your low water mark if you're frequently seeing "out-of-data-space" mode. Before this fix, if data space ran out the pool would be put in PM_READ_ONLY mode which also aborted the pool's current metadata transaction (data loss for any changes in the transaction). This had a side-effect of needlessly compromising data consistency. And retry of queued unserviceable bios, once the data pool was resized, could initiate changes to potentially inconsistent pool metadata. Now when the pool's data space is exhausted transition to a new pool mode (PM_OUT_OF_DATA_SPACE) that allows metadata to be changed but data may not be allocated. This allows users to remove thin volumes or discard data to recover data space. The pool is no longer put in PM_READ_ONLY mode in response to the pool running out of data space. And PM_READ_ONLY mode no longer aborts the pool's current metadata transaction. Also, set_pool_mode() will now notify userspace when the pool mode is changed. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2014-03-05	dm thin: ensure user takes action to validate data and metadata consistency	Mike Snitzer
	If a thin metadata operation fails the current transaction will abort, whereby causing potential for IO layers up the stack (e.g. filesystems) to have data loss. As such, set THIN_METADATA_NEEDS_CHECK_FLAG in the thin metadata's superblock which: 1) requires the user verify the thin metadata is consistent (e.g. use thin_check, etc) 2) suggests the user verify the thin data is consistent (e.g. use fsck) The only way to clear the superblock's THIN_METADATA_NEEDS_CHECK_FLAG is to run thin_repair. On metadata operation failure: abort current metadata transaction, set pool in read-only mode, and now set the needs_check flag. As part of this change, constraints are introduced or relaxed: * don't allow a pool to transition to write mode if needs_check is set * don't allow data or metadata space to be resized if needs_check is set * if a thin pool's metadata space is exhausted: the kernel will now force the user to take the pool offline for repair before the kernel will allow the metadata space to be extended. Also, update Documentation to include information about when the thin provisioning target commits metadata, how it handles metadata failures and running out of space. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com>