aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/nvdimm
AgeCommit message (Collapse)Author
2021-03-03libnvdimm/dimm: Avoid race between probe and available_slots_show()Dan Williams
commit 7018c897c2f243d4b5f1b94bc6b4831a7eab80fb upstream Richard reports that the following test: (while true; do cat /sys/bus/nd/devices/nmem*/available_slots 2>&1 > /dev/null done) & while true; do for i in $(seq 0 4); do echo nmem$i > /sys/bus/nd/drivers/nvdimm/bind done for i in $(seq 0 4); do echo nmem$i > /sys/bus/nd/drivers/nvdimm/unbind done done ...fails with a crash signature like: divide error: 0000 [#1] SMP KASAN PTI RIP: 0010:nd_label_nfree+0x134/0x1a0 [libnvdimm] [..] Call Trace: available_slots_show+0x4e/0x120 [libnvdimm] dev_attr_show+0x42/0x80 ? memset+0x20/0x40 sysfs_kf_seq_show+0x218/0x410 The root cause is that available_slots_show() consults driver-data, but fails to synchronize against device-unbind setting up a TOCTOU race to access uninitialized memory. Validate driver-data under the device-lock. Fixes: 4d88a97aa9e8 ("libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver infrastructure") Cc: <stable@vger.kernel.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Coly Li <colyli@suse.com> Reported-by: Richard Palethorpe <rpalethorpe@suse.com> Acked-by: Richard Palethorpe <rpalethorpe@suse.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> [sudip: use device_lock()] Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-29libnvdimm/namespace: Fix reaping of invalidated block-window-namespace labelsDan Williams
commit 2dd2a1740ee19cd2636d247276cf27bfa434b0e2 upstream. A recent change to ndctl to attempt to reconfigure namespaces in place uncovered a label accounting problem in block-window-type namespaces. The ndctl "create.sh" test is able to trigger this signature: WARNING: CPU: 34 PID: 9167 at drivers/nvdimm/label.c:1100 __blk_label_update+0x9a3/0xbc0 [libnvdimm] [..] RIP: 0010:__blk_label_update+0x9a3/0xbc0 [libnvdimm] [..] Call Trace: uuid_store+0x21b/0x2f0 [libnvdimm] kernfs_fop_write+0xcf/0x1c0 vfs_write+0xcc/0x380 ksys_write+0x68/0xe0 When allocated capacity for a namespace is renamed (new UUID) the labels with the old UUID need to be deleted. The ndctl behavior to always destroy namespaces on reconfiguration hid this problem. The immediate impact of this bug is limited since block-window-type namespaces only seem to exist in the specification and not in any shipping products. However, the label handling code is being reused for other technologies like CXL region labels, so there is a benefit to making sure both vertical labels sets (block-window) and horizontal label sets (pmem) have a functional reference implementation in libnvdimm. Fixes: c4703ce11c23 ("libnvdimm/namespace: Fix label tracking error") Cc: <stable@vger.kernel.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-09block: Move SECTOR_SIZE and SECTOR_SHIFT definitions into <linux/blkdev.h>Bart Van Assche
commit 233bde21aa43516baa013ef7ac33f3427056db3e upstream. It happens often while I'm preparing a patch for a block driver that I'm wondering: is a definition of SECTOR_SIZE and/or SECTOR_SHIFT available for this driver? Do I have to introduce definitions of these constants before I can use these constants? To avoid this confusion, move the existing definitions of SECTOR_SIZE and SECTOR_SHIFT into the <linux/blkdev.h> header file such that these become available for all block drivers. Make the SECTOR_SIZE definition in the uapi msdos_fs.h header file conditional to avoid that including that header file after <linux/blkdev.h> causes the compiler to complain about a SECTOR_SIZE redefinition. Note: the SECTOR_SIZE / SECTOR_SHIFT / SECTOR_BITS definitions have not been removed from uapi header files nor from NAND drivers in which these constants are used for another purpose than converting block layer offsets and sizes into a number of sectors. Cc: David S. Miller <davem@davemloft.net> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-11libnvdimm: Fix endian conversion issues Aneesh Kumar K.V
commit 86aa66687442ef45909ff9814b82b4d2bb892294 upstream. nd_label->dpa issue was observed when trying to enable the namespace created with little-endian kernel on a big-endian kernel. That made me run `sparse` on the rest of the code and other changes are the result of that. Fixes: d9b83c756953 ("libnvdimm, btt: rework error clearing") Fixes: 9dedc73a4658 ("libnvdimm/btt: Fix LBA masking during 'free list' population") Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/20190809074726.27815-1-aneesh.kumar@linux.ibm.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Cc: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-27libnvdimm/btt: Fix LBA masking during 'free list' populationVishal Verma
[ Upstream commit 9dedc73a4658ebcc0c9b58c3cb84e9ac80122213 ] The Linux BTT implementation assumes that log entries will never have the 'zero' flag set, and indeed it never sets that flag for log entries itself. However, the UEFI spec is ambiguous on the exact format of the LBA field of a log entry, specifically as to whether it should include the additional flag bits or not. While a zero bit doesn't make sense in the context of a log entry, other BTT implementations might still have it set. If an implementation does happen to have it set, we would happily read it in as the next block to write to for writes. Since a high bit is set, it pushes the block number out of the range of an 'arena', and we fail such a write with an EIO. Follow the robustness principle, and tolerate such implementations by stripping out the zero flag when populating the free list during initialization. Additionally, use the same stripped out entries for detection of incomplete writes and map restoration that happens at this stage. Add a sysfs file 'log_zero_flags' that indicates the ability to accept such a layout to userspace applications. This enables 'ndctl check-namespace' to recognize whether the kernel is able to handle zero flags, or whether it should attempt a fix-up under the --repair option. Cc: Dan Williams <dan.j.williams@intel.com> Reported-by: Dexuan Cui <decui@microsoft.com> Reported-by: Pedro d'Aquino Filocre F S Barbuda <pbarbuda@microsoft.com> Tested-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-27libnvdimm/btt: Remove unnecessary code in btt_freelist_initVishal Verma
[ Upstream commit 2f8c9011151337d0bc106693f272f9bddbccfab2 ] We call btt_log_read() twice, once to get the 'old' log entry, and again to get the 'new' entry. However, we have no use for the 'old' entry, so remove it. Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-24libnvdimm: Out of bounds read in __nd_ioctl()Dan Carpenter
[ Upstream commit f84afbdd3a9e5e10633695677b95422572f920dc ] The "cmd" comes from the user and it can be up to 255. It it's more than the number of bits in long, it results out of bounds read when we check test_bit(cmd, &cmd_mask). The highest valid value for "cmd" is ND_CMD_CALL (10) so I added a compare against that. Fixes: 62232e45f4a2 ("libnvdimm: control (ioctl) messages for nvdimm_bus and nvdimm devices") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Link: https://lore.kernel.org/r/20200225162055.amtosfy7m35aivxg@kili.mountain Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-04libnvdimm/btt: fix variable 'rc' set but not usedQian Cai
[ Upstream commit 4e24e37d5313edca8b4ab86f240c046c731e28d6 ] drivers/nvdimm/btt.c: In function 'btt_read_pg': drivers/nvdimm/btt.c:1264:8: warning: variable 'rc' set but not used [-Wunused-but-set-variable] int rc; ^~ Add a ratelimited message in case a storm of errors is encountered. Fixes: d9b83c756953 ("libnvdimm, btt: rework error clearing") Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Link: https://lore.kernel.org/r/1572530719-32161-1-git-send-email-cai@lca.pw Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-31libnvdimm/pfn: fix fsdax-mode namespace info-block zero-fieldsDan Williams
commit 7e3e888dfc138089f4c15a81b418e88f0978f744 upstream. At namespace creation time there is the potential for the "expected to be zero" fields of a 'pfn' info-block to be filled with indeterminate data. While the kernel buffer is zeroed on allocation it is immediately overwritten by nd_pfn_validate() filling it with the current contents of the on-media info-block location. For fields like, 'flags' and the 'padding' it potentially means that future implementations can not rely on those fields being zero. In preparation to stop using the 'start_pad' and 'end_trunc' fields for section alignment, arrange for fields that are not explicitly initialized to be guaranteed zero. Bump the minor version to indicate it is safe to assume the 'padding' and 'flags' are zero. Otherwise, this corruption is expected to benign since all other critical fields are explicitly initialized. Note The cc: stable is about spreading this new policy to as many kernels as possible not fixing an issue in those kernels. It is not until the change titled "libnvdimm/pfn: Stop padding pmem namespaces to section alignment" where this improper initialization becomes a problem. So if someone decides to backport "libnvdimm/pfn: Stop padding pmem namespaces to section alignment" (which is not tagged for stable), make sure this pre-requisite is flagged. Link: http://lkml.kernel.org/r/156092356065.979959.6681003754765958296.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: 32ab0a3f5170 ("libnvdimm, pmem: 'struct page' for pmem") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64] Cc: <stable@vger.kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richardw.yang@linux.intel.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19libnvdimm: Fix compilation warnings with W=1Qian Cai
[ Upstream commit c01dafad77fea8d64c4fdca0a6031c980842ad65 ] Several places (dimm_devs.c, core.c etc) include label.h but only label.c uses NSINDEX_SIGNATURE, so move its definition to label.c instead. In file included from drivers/nvdimm/dimm_devs.c:23: drivers/nvdimm/label.h:41:19: warning: 'NSINDEX_SIGNATURE' defined but not used [-Wunused-const-variable=] Also, some places abuse "/**" which is only reserved for the kernel-doc. drivers/nvdimm/bus.c:648: warning: cannot understand function prototype: 'struct attribute_group nd_device_attribute_group = ' drivers/nvdimm/bus.c:677: warning: cannot understand function prototype: 'struct attribute_group nd_numa_attribute_group = ' Those are just some member assignments for the "struct attribute_group" instances and it can't be expressed in the kernel-doc. Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31libnvdimm/namespace: Fix label tracking errorDan Williams
commit c4703ce11c23423d4b46e3d59aef7979814fd608 upstream. Users have reported intermittent occurrences of DIMM initialization failures due to duplicate allocations of address capacity detected in the labels, or errors of the form below, both have the same root cause. nd namespace1.4: failed to track label: 0 WARNING: CPU: 17 PID: 1381 at drivers/nvdimm/label.c:863 RIP: 0010:__pmem_label_update+0x56c/0x590 [libnvdimm] Call Trace: ? nd_pmem_namespace_label_update+0xd6/0x160 [libnvdimm] nd_pmem_namespace_label_update+0xd6/0x160 [libnvdimm] uuid_store+0x17e/0x190 [libnvdimm] kernfs_fop_write+0xf0/0x1a0 vfs_write+0xb7/0x1b0 ksys_write+0x57/0xd0 do_syscall_64+0x60/0x210 Unfortunately those reports were typically with a busy parallel namespace creation / destruction loop making it difficult to see the components of the bug. However, Jane provided a simple reproducer using the work-in-progress sub-section implementation. When ndctl is reconfiguring a namespace it may take an existing defunct / disabled namespace and reconfigure it with a new uuid and other parameters. Critically namespace_update_uuid() takes existing address resources and renames them for the new namespace to use / reconfigure as it sees fit. The bug is that this rename only happens in the resource tracking tree. Existing labels with the old uuid are not reaped leading to a scenario where multiple active labels reference the same span of address range. Teach namespace_update_uuid() to flag any references to the old uuid for reaping at the next label update attempt. Cc: <stable@vger.kernel.org> Fixes: bf9bccc14c05 ("libnvdimm: pmem label sets and namespace instantiation") Link: https://github.com/pmem/ndctl/issues/91 Reported-by: Jane Chu <jane.chu@oracle.com> Reported-by: Jeff Moyer <jmoyer@redhat.com> Reported-by: Erwin Tsaur <erwin.tsaur@oracle.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-31libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overheadDan Williams
commit 52f476a323f9efc959be1c890d0cdcf12e1582e0 upstream. Jeff discovered that performance improves from ~375K iops to ~519K iops on a simple psync-write fio workload when moving the location of 'struct page' from the default PMEM location to DRAM. This result is surprising because the expectation is that 'struct page' for dax is only needed for third party references to dax mappings. For example, a dax-mapped buffer passed to another system call for direct-I/O requires 'struct page' for sending the request down the driver stack and pinning the page. There is no usage of 'struct page' for first party access to a file via read(2)/write(2) and friends. However, this "no page needed" expectation is violated by CONFIG_HARDENED_USERCOPY and the check_copy_size() performed in copy_from_iter_full_nocache() and copy_to_iter_mcsafe(). The check_heap_object() helper routine assumes the buffer is backed by a slab allocator (DRAM) page and applies some checks. Those checks are invalid, dax pages do not originate from the slab, and redundant, dax_iomap_actor() has already validated that the I/O is within bounds. Specifically that routine validates that the logical file offset is within bounds of the file, then it does a sector-to-pfn translation which validates that the physical mapping is within bounds of the block device. Bypass additional hardened usercopy overhead and call the 'no check' versions of the copy_{to,from}_iter operations directly. Fixes: 0aed55af8834 ("x86, uaccess: introduce copy_from_iter_flushcache...") Cc: <stable@vger.kernel.org> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Matthew Wilcox <willy@infradead.org> Reported-and-tested-by: Jeff Smits <jeff.smits@intel.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-16libnvdimm/btt: Fix a kmemdup failure checkAditya Pakki
[ Upstream commit 486fa92df4707b5df58d6508728bdb9321a59766 ] In case kmemdup fails, the fix releases resources and returns to avoid the NULL pointer dereference. Signed-off-by: Aditya Pakki <pakki001@umn.edu> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-16libnvdimm/namespace: Fix a potential NULL pointer dereferenceKangjie Lu
[ Upstream commit 55c1fc0af29a6c1b92f217b7eb7581a882e0c07c ] In case kmemdup fails, the fix goes to blk_err to avoid NULL pointer dereference. Signed-off-by: Kangjie Lu <kjlu@umn.edu> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-23libnvdimm: Fix altmap reservation size calculationOliver O'Halloran
commit 07464e88365e9236febaca9ed1a2e2006d8bc952 upstream. Libnvdimm reserves the first 8K of pfn and devicedax namespaces to store a superblock describing the namespace. This 8K reservation is contained within the altmap area which the kernel uses for the vmemmap backing for the pages within the namespace. The altmap allows for some pages at the start of the altmap area to be reserved and that mechanism is used to protect the superblock from being re-used as vmemmap backing. The number of PFNs to reserve is calculated using: PHYS_PFN(SZ_8K) Which is implemented as: #define PHYS_PFN(x) ((unsigned long)((x) >> PAGE_SHIFT)) So on systems where PAGE_SIZE is greater than 8K the reservation size is truncated to zero and the superblock area is re-used as vmemmap backing. As a result all the namespace information stored in the superblock (i.e. if it's a PFN or DAX namespace) is lost and the namespace needs to be re-created to get access to the contents. This patch fixes this by using PFN_UP() rather than PHYS_PFN() to ensure that at least one page is reserved. On systems with a 4K pages size this patch should have no effect. Cc: stable@vger.kernel.org Cc: Dan Williams <dan.j.williams@intel.com> Fixes: ac515c084be9 ("libnvdimm, pmem, pfn: move pfn setup to the core") Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-23libnvdimm/pmem: Honor force_raw for legacy pmem regionsDan Williams
commit fa7d2e639cd90442d868dfc6ca1d4cc9d8bf206e upstream. For recovery, where non-dax access is needed to a given physical address range, and testing, allow the 'force_raw' attribute to override the default establishment of a dev_pagemap. Otherwise without this capability it is possible to end up with a namespace that can not be activated due to corrupted info-block, and one that can not be repaired due to a section collision. Cc: <stable@vger.kernel.org> Fixes: 004f1afbe199 ("libnvdimm, pmem: direct map legacy pmem by default") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-23libnvdimm, pfn: Fix over-trim in trim_pfn_device()Wei Yang
commit f101ada7da6551127d192c2f1742c1e9e0f62799 upstream. When trying to see whether current nd_region intersects with others, trim_pfn_device() has already calculated the *size* to be expanded to SECTION size. Do not double append 'adjust' to 'size' when calculating whether the end of a region collides with the next pmem region. Fixes: ae86cbfef381 "libnvdimm, pfn: Pad pfn namespaces relative to other regions" Cc: <stable@vger.kernel.org> Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-23libnvdimm/label: Clear 'updating' flag after label-set updateDan Williams
commit 966d23a006ca7b44ac8cf4d0c96b19785e0c3da0 upstream. The UEFI 2.7 specification sets expectations that the 'updating' flag is eventually cleared. To date, the libnvdimm core has never adhered to that protocol. The policy of the core matches the policy of other multi-device info-block formats like MD-Software-RAID that expect administrator intervention on inconsistent info-blocks, not automatic invalidation. However, some pre-boot environments may unfortunately attempt to "clean up" the labels and invalidate a set when it fails to find at least one "non-updating" label in the set. Clear the updating flag after set updates to minimize the window of vulnerability to aggressive pre-boot environments. Ideally implementations would not write to the label area outside of creating namespaces. Note that this only minimizes the window, it does not close it as the system can still crash while clearing the flag and the set can be subsequently deleted / invalidated by the pre-boot environment. Fixes: f524bf271a5c ("libnvdimm: write pmem label set") Cc: <stable@vger.kernel.org> Cc: Kelly Couch <kelly.j.couch@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-13libnvdimm, pfn: Pad pfn namespaces relative to other regionsDan Williams
commit ae86cbfef3818300f1972e52f67a93211acb0e24 upstream. Commit cfe30b872058 "libnvdimm, pmem: adjust for section collisions with 'System RAM'" enabled Linux to workaround occasions where platform firmware arranges for "System RAM" and "Persistent Memory" to collide within a single section boundary. Unfortunately, as reported in this issue [1], platform firmware can inflict the same collision between persistent memory regions. The approach of interrogating iomem_resource does not work in this case because platform firmware may merge multiple regions into a single iomem_resource range. Instead provide a method to interrogate regions that share the same parent bus. This is a stop-gap until the core-MM can grow support for hotplug on sub-section boundaries. [1]: https://github.com/pmem/ndctl/issues/76 Fixes: cfe30b872058 ("libnvdimm, pmem: adjust for section collisions with...") Cc: <stable@vger.kernel.org> Reported-by: Patrick Geary <patrickg@supermicro.com> Tested-by: Patrick Geary <patrickg@supermicro.com> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-13libnvdimm, region: Fail badblocks listing for inactive regionsDan Williams
commit 5d394eee2c102453278d81d9a7cf94c80253486a upstream. While experimenting with region driver loading the following backtrace was triggered: INFO: trying to register non-static key. the code is fine but needs lockdep annotation. turning off the locking correctness validator. [..] Call Trace: dump_stack+0x85/0xcb register_lock_class+0x571/0x580 ? __lock_acquire+0x2ba/0x1310 ? kernfs_seq_start+0x2a/0x80 __lock_acquire+0xd4/0x1310 ? dev_attr_show+0x1c/0x50 ? __lock_acquire+0x2ba/0x1310 ? kernfs_seq_start+0x2a/0x80 ? lock_acquire+0x9e/0x1a0 lock_acquire+0x9e/0x1a0 ? dev_attr_show+0x1c/0x50 badblocks_show+0x70/0x190 ? dev_attr_show+0x1c/0x50 dev_attr_show+0x1c/0x50 This results from a missing successful call to devm_init_badblocks() from nd_region_probe(). Block attempts to show badblocks while the region is not enabled. Fixes: 6a6bef90425e ("libnvdimm: add mechanism to publish badblocks...") Cc: <stable@vger.kernel.org> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-13libnvdimm: Hold reference on parent while scheduling async initAlexander Duyck
commit b6eae0f61db27748606cc00dafcfd1e2c032f0a5 upstream. Unlike asynchronous initialization in the core we have not yet associated the device with the parent, and as such the device doesn't hold a reference to the parent. In order to resolve that we should be holding a reference on the parent until the asynchronous initialization has completed. Cc: <stable@vger.kernel.org> Fixes: 4d88a97aa9e8 ("libnvdimm: ...base ... infrastructure") Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-09libnvdimm: fix ars_status output length calculationVishal Verma
commit 286e87718103acdf85f4ed323a37e4839a8a7c05 upstream. Commit efda1b5d87cb ("acpi, nfit, libnvdimm: fix / harden ars_status output length handling") Introduced additional hardening for ambiguity in the ACPI spec for ars_status output sizing. However, it had a couple of cases mixed up. Where it should have been checking for (and returning) "out_field[1] - 4" it was using "out_field[1] - 8" and vice versa. This caused a four byte discrepancy in the buffer size passed on to the command handler, and in some cases, this caused memory corruption like: ./daxdev-errors.sh: line 76: 24104 Aborted (core dumped) ./daxdev-errors $busdev $region malloc(): memory corruption Program received signal SIGABRT, Aborted. [...] #5 0x00007ffff7865a2e in calloc () from /lib64/libc.so.6 #6 0x00007ffff7bc2970 in ndctl_bus_cmd_new_ars_status (ars_cap=ars_cap@entry=0x6153b0) at ars.c:136 #7 0x0000000000401644 in check_ars_status (check=0x7fffffffdeb0, bus=0x604c20) at daxdev-errors.c:144 #8 test_daxdev_clear_error (region_name=<optimized out>, bus_name=<optimized out>) at daxdev-errors.c:332 Cc: <stable@vger.kernel.org> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Lukasz Dorau <lukasz.dorau@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Fixes: efda1b5d87cb ("acpi, nfit, libnvdimm: fix / harden ars_status output length handling") Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-of-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-03linvdimm, pmem: Preserve read-only setting for pmem devicesRobert Elliott
commit 254a4cd50b9fe2291a12b8902e08e56dcc4e9b10 upstream. The pmem driver does not honor a forced read-only setting for very long: $ blockdev --setro /dev/pmem0 $ blockdev --getro /dev/pmem0 1 followed by various commands like these: $ blockdev --rereadpt /dev/pmem0 or $ mkfs.ext4 /dev/pmem0 results in this in the kernel serial log: nd_pmem namespace0.0: region0 read-write, marking pmem0 read-write with the read-only setting lost: $ blockdev --getro /dev/pmem0 0 That's from bus.c nvdimm_revalidate_disk(), which always applies the setting from nd_region (which is initially based on the ACPI NFIT NVDIMM state flags not_armed bit). In contrast, commit 20bd1d026aac ("scsi: sd: Keep disk read-only when re-reading partition") fixed this issue for SCSI devices to preserve the previous setting if it was set to read-only. This patch modifies bus.c to preserve any previous read-only setting. It also eliminates the kernel serial log print except for cases where read-write is changed to read-only, so it doesn't print read-only to read-only non-changes. Cc: <stable@vger.kernel.org> Fixes: 581388209405 ("libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only") Signed-off-by: Robert Elliott <elliott@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-24libnvdimm, namespace: use a safe lookup for dimm device nameDan Williams
commit 4f8672201b7e7ed4f5f6c3cf6dcd080648580582 upstream. The following NULL dereference results from incorrectly assuming that ndd is valid in this print: struct nvdimm_drvdata *ndd = to_ndd(&nd_region->mapping[i]); /* * Give up if we don't find an instance of a uuid at each * position (from 0 to nd_region->ndr_mappings - 1), or if we * find a dimm with two instances of the same uuid. */ dev_err(&nd_region->dev, "%s missing label for %pUb\n", dev_name(ndd->dev), nd_label->uuid); BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 IP: nd_region_register_namespaces+0xd67/0x13c0 [libnvdimm] PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 43 PID: 673 Comm: kworker/u609:10 Not tainted 4.16.0-rc4+ #1 [..] RIP: 0010:nd_region_register_namespaces+0xd67/0x13c0 [libnvdimm] [..] Call Trace: ? devres_add+0x2f/0x40 ? devm_kmalloc+0x52/0x60 ? nd_region_activate+0x9c/0x320 [libnvdimm] nd_region_probe+0x94/0x260 [libnvdimm] ? kernfs_add_one+0xe4/0x130 nvdimm_bus_probe+0x63/0x100 [libnvdimm] Switch to using the nvdimm device directly. Fixes: 0e3b0d123c8f ("libnvdimm, namespace: allow multiple pmem...") Cc: <stable@vger.kernel.org> Reported-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-24libnvdimm, dimm: fix dpa reservation vs uninitialized label areaDan Williams
commit c31898c8c711f2bbbcaebe802a55827e288d875a upstream. At initialization time the 'dimm' driver caches a copy of the memory device's label area and reserves address space for each of the namespaces defined. However, as can be seen below, the reservation occurs even when the index blocks are invalid: nvdimm nmem0: nvdimm_init_config_data: len: 131072 rc: 0 nvdimm nmem0: config data size: 131072 nvdimm nmem0: __nd_label_validate: nsindex0 labelsize 1 invalid nvdimm nmem0: __nd_label_validate: nsindex1 labelsize 1 invalid nvdimm nmem0: : pmem-6025e505: 0x1000000000 @ 0xf50000000 reserve <-- bad Gate dpa reservation on the presence of valid index blocks. Cc: <stable@vger.kernel.org> Fixes: 4a826c83db4e ("libnvdimm: namespace indices: read and validate") Reported-by: Krzysztof Rusocki <krzysztof.rusocki@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-28libnvdimm, {btt, blk}: do integrity setup before add_disk()Vishal Verma
commit 3ffb0ba9b567a8efb9a04ed3d1ec15ff333ada22 upstream. Prior to 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk") we needed to temporarily add a zero-capacity disk before registering for blk-integrity. But adding a zero-capacity disk caused the partition table scanning to bail early, and this resulted in partitions not coming up after a probe of the BTT or blk namespaces. We can now register for integrity before the disk has been added, and this fixes the rescan problems. Fixes: 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk") Reported-by: Dariusz Dokupil <dariusz.dokupil@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-29libnvdimm, pfn: fix start_pad handling for aligned namespacesDan Williams
commit 19deaa217bc04e83b59b5e8c8229eb0e53ad9efc upstream. The alignment checks at pfn driver startup fail to properly account for the 'start_pad' in the case where the namespace is misaligned relative to its internal alignment. This is typically triggered in 1G aligned namespace, but could theoretically trigger with small namespace alignments. When this triggers the kernel reports messages of the form: dax2.1: bad offset: 0x3c000000 dax disabled align: 0x40000000 Fixes: 1ee6667cd8d1 ("libnvdimm, pfn, dax: fix initialization vs autodetect...") Reported-by: Jane Chu <jane.chu@oracle.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-29libnvdimm, btt: Fix an incompatibility in the log layoutVishal Verma
commit 24e3a7fb60a9187e5df90e5fa655ffc94b9c4f77 upstream. Due to a spec misinterpretation, the Linux implementation of the BTT log area had different padding scheme from other implementations, such as UEFI and NVML. This fixes the padding scheme, and defaults to it for new BTT layouts. We attempt to detect the padding scheme in use when probing for an existing BTT. If we detect the older/incompatible scheme, we continue using it. Reported-by: Juston Li <juston.li@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Fixes: 5212e11fde4d ("nd_btt: atomic sector updates") Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-29libnvdimm, dax: fix 1GB-aligned namespaces vs physical misalignmentDan Williams
commit 41fce90f26333c4fa82e8e43b9ace86c4e8a0120 upstream. The following namespace configuration attempt: # ndctl create-namespace -e namespace0.0 -m devdax -a 1G -f libndctl: ndctl_dax_enable: dax0.1: failed to enable Error: namespace0.0: failed to enable failed to reconfigure namespace: No such device or address ...fails when the backing memory range is not physically aligned to 1G: # cat /proc/iomem | grep Persistent 210000000-30fffffff : Persistent Memory (legacy) In the above example the 4G persistent memory range starts and ends on a 256MB boundary. We handle this case correctly when needing to handle cases that violate section alignment (128MB) collisions against "System RAM", and we simply need to extend that padding/truncation for the 1GB alignment use case. Fixes: 315c562536c4 ("libnvdimm, pfn: add 'align' attribute...") Reported-and-tested-by: Jane Chu <jane.chu@oracle.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-30libnvdimm, namespace: make 'resource' attribute only readable by rootDan Williams
commit c1fb3542074fd0c4d901d778bd52455111e4eb6f upstream. For the same reason that /proc/iomem returns 0's for non-root readers and acpi tables are root-only, make the 'resource' attribute for namespace devices only readable by root. Otherwise we disclose physical address information. Fixes: bf9bccc14c05 ("libnvdimm: pmem label sets and namespace instantiation") Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-30libnvdimm, region : make 'resource' attribute only readable by rootDan Williams
commit b8ff981f88df03c72a4de2f6eaa9ce447a10ac03 upstream. For the same reason that /proc/iomem returns 0's for non-root readers and acpi tables are root-only, make the 'resource' attribute for region devices only readable by root. Otherwise we disclose physical address information. Fixes: 802f4be6feee ("libnvdimm: Add 'resource' sysfs attribute to regions") Cc: Dave Jiang <dave.jiang@intel.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-30libnvdimm, namespace: fix label initialization to use valid seq numbersDan Williams
commit b18d4b8a25af6fe83d7692191d6ff962ea611c4f upstream. The set of valid sequence numbers is {1,2,3}. The specification indicates that an implementation should consider 0 a sign of a critical error: UEFI 2.7: 13.19 NVDIMM Label Protocol Software never writes the sequence number 00, so a correctly check-summed Index Block with this sequence number probably indicates a critical error. When software discovers this case it treats it as an invalid Index Block indication. While the expectation is that the invalid block is just thrown away, the Robustness Principle says we should fix this to make both sequence numbers valid. Fixes: f524bf271a5c ("libnvdimm: write pmem label set") Reported-by: Juston Li <juston.li@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-30libnvdimm, pfn: make 'resource' attribute only readable by rootDan Williams
commit 26417ae4fc6108f8db436f24108b08f68bdc520e upstream. For the same reason that /proc/iomem returns 0's for non-root readers and acpi tables are root-only, make the 'resource' attribute for pfn devices only readable by root. Otherwise we disclose physical address information. Fixes: f6ed58c70d14 ("libnvdimm, pfn: 'resource'-address and 'size'...") Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-30libnvdimm, dimm: clear 'locked' status on successful DIMM enableDan Williams
commit d34cb808402898e53b9a9bcbbedd01667a78723b upstream. If we successfully enable a DIMM then it must not be locked and we can clear the label-read failure condition. Otherwise, we need to reload the entire bus provider driver to achieve the same effect, and that can disrupt unrelated DIMMs and namespaces. Fixes: 9d62ed965118 ("libnvdimm: handle locked label storage areas") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02License cleanup: add SPDX GPL-2.0 license identifier to files with no licenseGreg Kroah-Hartman
Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-18libnvdimm, namespace: fix btt claim class crashDan Williams
Maurice reports: BUG: unable to handle kernel NULL pointer dereference at 0000000000000028 IP: holder_class_store+0x253/0x2b0 [libnvdimm] ...while trying to reconfigure an NVDIMM-N namespace into 'sector' / 'btt' mode. The crash points to this line: (gdb) li *(holder_class_store+0x253) 0x7773 is in holder_class_store (drivers/nvdimm/namespace_devs.c:1420). 1415 for (i = 0; i < nd_region->ndr_mappings; i++) { 1416 struct nd_mapping *nd_mapping = &nd_region->mapping[i]; 1417 struct nvdimm_drvdata *ndd = to_ndd(nd_mapping); 1418 struct nd_namespace_index *nsindex; 1419 1420 nsindex = to_namespace_index(ndd, ndd->ns_current); ...where we are failing because ndd is NULL due to NVDIMM-N dimms not supporting labels. Long story short, default to the BTTv1 format in the label-less / NVDIMM-N case. Fixes: 14e494542636 ("libnvdimm, btt: BTT updates for UEFI 2.7 format") Cc: <stable@vger.kernel.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Reported-by: Maurice A. Saldivar <maurice.a.saldivar@hpe.com> Tested-by: Maurice A. Saldivar <maurice.a.saldivar@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-09-14Merge tag 'for-4.14/dm-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Some request-based DM core and DM multipath fixes and cleanups - Constify a few variables in DM core and DM integrity - Add bufio optimization and checksum failure accounting to DM integrity - Fix DM integrity to avoid checking integrity of failed reads - Fix DM integrity to use init_completion - A couple DM log-writes target fixes - Simplify DAX flushing by eliminating the unnecessary flush abstraction that was stood up for DM's use. * tag 'for-4.14/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dax: remove the pmem_dax_ops->flush abstraction dm integrity: use init_completion instead of COMPLETION_INITIALIZER_ONSTACK dm integrity: make blk_integrity_profile structure const dm integrity: do not check integrity for failed read operations dm log writes: fix >512b sectorsize support dm log writes: don't use all the cpu while waiting to log blocks dm ioctl: constify ioctl lookup table dm: constify argument arrays dm integrity: count and display checksum failures dm integrity: optimize writing dm-bufio buffers that are partially changed dm rq: do not update rq partially in each ending bio dm rq: make dm-sq requeuing behavior consistent with dm-mq behavior dm mpath: complain about unsupported __multipath_map_bio() return values dm mpath: avoid that building with W=1 causes gcc 7 to complain about fall-through
2017-09-11Merge tag 'libnvdimm-for-4.14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm from Dan Williams: "A rework of media error handling in the BTT driver and other updates. It has appeared in a few -next releases and collected some late- breaking build-error and warning fixups as a result. Summary: - Media error handling support in the Block Translation Table (BTT) driver is reworked to address sleeping-while-atomic locking and memory-allocation-context conflicts. - The dax_device lookup overhead for xfs and ext4 is moved out of the iomap hot-path to a mount-time lookup. - A new 'ecc_unit_size' sysfs attribute is added to advertise the read-modify-write boundary property of a persistent memory range. - Preparatory fix-ups for arm and powerpc pmem support are included along with other miscellaneous fixes" * tag 'libnvdimm-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (26 commits) libnvdimm, btt: fix format string warnings libnvdimm, btt: clean up warning and error messages ext4: fix null pointer dereference on sbi libnvdimm, nfit: move the check on nd_reserved2 to the endpoint dax: fix FS_DAX=n BLOCK=y compilation libnvdimm: fix integer overflow static analysis warning libnvdimm, nd_blk: remove mmio_flush_range() libnvdimm, btt: rework error clearing libnvdimm: fix potential deadlock while clearing errors libnvdimm, btt: cache sector_size in arena_info libnvdimm, btt: ensure that flags were also unchanged during a map_read libnvdimm, btt: refactor map entry operations with macros libnvdimm, btt: fix a missed NVDIMM_IO_ATOMIC case in the write path libnvdimm, nfit: export an 'ecc_unit_size' sysfs attribute ext4: perform dax_device lookup at mount ext2: perform dax_device lookup at mount xfs: perform dax_device lookup at mount dax: introduce a fs_dax_get_by_bdev() helper libnvdimm, btt: check memory allocation failure libnvdimm, label: fix index block size calculation ...
2017-09-11dax: remove the pmem_dax_ops->flush abstractionMikulas Patocka
Commit abebfbe2f731 ("dm: add ->flush() dax operation support") is buggy. A DM device may be composed of multiple underlying devices and all of them need to be flushed. That commit just routes the flush request to the first device and ignores the other devices. It could be fixed by adding more complex logic to the device mapper. But there is only one implementation of the method pmem_dax_ops->flush - that is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we don't need the pmem_dax_ops->flush abstraction at all, we can call arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush can't ever reach anything different from arch_wb_cache_pmem(). It should be also pointed out that for some uses of persistent memory it is needed to flush only a very small amount of data (such as 1 cacheline), and it would be overkill if we go through that device mapper machinery for a single flushed cache line. Fix this by removing the pmem_dax_ops->flush abstraction and call arch_wb_cache_pmem() directly from dax_flush(). Also, remove the device mapper code that forwards the flushes. Fixes: abebfbe2f731 ("dm: add ->flush() dax operation support") Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-09-09libnvdimm, btt: fix format string warningsRandy Dunlap
Fix format warnings (seen on i386) in nvdimm/btt.c: ../drivers/nvdimm/btt.c: In function ‘btt_map_init’: ../drivers/nvdimm/btt.c:430:3: warning: format ‘%lx’ expects argument of type ‘long unsigned int’, but argument 4 has type ‘size_t’ [-Wformat=] dev_WARN_ONCE(to_dev(arena), size < 512, ^ ../drivers/nvdimm/btt.c: In function ‘btt_log_init’: ../drivers/nvdimm/btt.c:474:3: warning: format ‘%lx’ expects argument of type ‘long unsigned int’, but argument 4 has type ‘size_t’ [-Wformat=] dev_WARN_ONCE(to_dev(arena), size < 512, ^ Fixes: 86652d2eb347 ("libnvdimm, btt: clean up warning and error messages") Reported-by: Arnd Bergmann <arnd@arndb.de> Reported-by: kbuild test robot <fengguang.wu@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-09-07libnvdimm, btt: clean up warning and error messagesVishal Verma
Convert all WARN* style messages to dev_WARN, and for errors in the IO paths, use dev_err_ratelimited. Also remove some BUG_ONs in the IO path and replace them with the above - no need to crash the machine in case of an unaligned IO. Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-09-07Merge branch 'for-4.14/block' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block layer updates from Jens Axboe: "This is the first pull request for 4.14, containing most of the code changes. It's a quiet series this round, which I think we needed after the churn of the last few series. This contains: - Fix for a registration race in loop, from Anton Volkov. - Overflow complaint fix from Arnd for DAC960. - Series of drbd changes from the usual suspects. - Conversion of the stec/skd driver to blk-mq. From Bart. - A few BFQ improvements/fixes from Paolo. - CFQ improvement from Ritesh, allowing idling for group idle. - A few fixes found by Dan's smatch, courtesy of Dan. - A warning fixup for a race between changing the IO scheduler and device remova. From David Jeffery. - A few nbd fixes from Josef. - Support for cgroup info in blktrace, from Shaohua. - Also from Shaohua, new features in the null_blk driver to allow it to actually hold data, among other things. - Various corner cases and error handling fixes from Weiping Zhang. - Improvements to the IO stats tracking for blk-mq from me. Can drastically improve performance for fast devices and/or big machines. - Series from Christoph removing bi_bdev as being needed for IO submission, in preparation for nvme multipathing code. - Series from Bart, including various cleanups and fixes for switch fall through case complaints" * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits) kernfs: checking for IS_ERR() instead of NULL drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set drbd: Fix allyesconfig build, fix recent commit drbd: switch from kmalloc() to kmalloc_array() drbd: abort drbd_start_resync if there is no connection drbd: move global variables to drbd namespace and make some static drbd: rename "usermode_helper" to "drbd_usermode_helper" drbd: fix race between handshake and admin disconnect/down drbd: fix potential deadlock when trying to detach during handshake drbd: A single dot should be put into a sequence. drbd: fix rmmod cleanup, remove _all_ debugfs entries drbd: Use setup_timer() instead of init_timer() to simplify the code. drbd: fix potential get_ldev/put_ldev refcount imbalance during attach drbd: new disk-option disable-write-same drbd: Fix resource role for newly created resources in events2 drbd: mark symbols static where possible drbd: Send P_NEG_ACK upon write error in protocol != C drbd: add explicit plugging when submitting batches drbd: change list_for_each_safe to while(list_first_entry_or_null) drbd: introduce drbd_recv_header_maybe_unplug ...
2017-09-06block, THP: make block_device_operations.rw_page support THPHuang Ying
The .rw_page in struct block_device_operations is used by the swap subsystem to read/write the page contents from/into the corresponding swap slot in the swap device. To support the THP (Transparent Huge Page) swap optimization, the .rw_page is enhanced to support to read/write THP if possible. Link: http://lkml.kernel.org/r/20170724051840.2309-6-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c] Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vishal L Verma <vishal.l.verma@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-04libnvdimm, nfit: move the check on nd_reserved2 to the endpointMeng Xu
Delay the check of nd_reserved2 to the actual endpoint (acpi_nfit_ctl) that uses it, as a prevention of a potential double-fetch bug. While examining the kernel source code, I found a dangerous operation that could turn into a double-fetch situation (a race condition bug) where the same userspace memory region are fetched twice into kernel with sanity checks after the first fetch while missing checks after the second fetch. In the case of _IOC_NR(ioctl_cmd) == ND_CMD_CALL: 1. The first fetch happens in line 935 copy_from_user(&pkg, p, sizeof(pkg) 2. subsequently `pkg.nd_reserved2` is asserted to be all zeroes (line 984 to 986). 3. The second fetch happens in line 1022 copy_from_user(buf, p, buf_len) 4. Given that `p` can be fully controlled in userspace, an attacker can race condition to override the header part of `p`, say, `((struct nd_cmd_pkg *)p)->nd_reserved2` to arbitrary value (say nine 0xFFFFFFFF for `nd_reserved2`) after the first fetch but before the second fetch. The changed value will be copied to `buf`. 5. There is no checks on the second fetches until the use of it in line 1034: nd_cmd_clear_to_send(nvdimm_bus, nvdimm, cmd, buf) and line 1038: nd_desc->ndctl(nd_desc, nvdimm, cmd, buf, buf_len, &cmd_rc) which means that the assumed relation, `p->nd_reserved2` are all zeroes might not hold after the second fetch. And once the control goes to these functions we lose the context to assert the assumed relation. 6. Based on my manual analysis, `p->nd_reserved2` is not used in function `nd_cmd_clear_to_send` and potential implementations of `nd_desc->ndctl` so there is no working exploit against it right now. However, this could easily turns to an exploitable one if careless developers start to use `p->nd_reserved2` later and assume that they are all zeroes. Move the validation of the nd_reserved2 field to the ->ndctl() implementation where it has a stable buffer to evaluate. Signed-off-by: Meng Xu <mengxu.gatech@gmail.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-08-31libnvdimm: fix integer overflow static analysis warningDan Williams
Dan reports: The patch 62232e45f4a2: "libnvdimm: control (ioctl) messages for nvdimm_bus and nvdimm devices" from Jun 8, 2015, leads to the following static checker warning: drivers/nvdimm/bus.c:1018 __nd_ioctl() warn: integer overflows 'buf_len' From a casual review, this seems like it might be a real bug. On the first iteration we load some data into in_env[]. On the second iteration we read a use controlled "in_size" from nd_cmd_in_size(). It can go up to UINT_MAX - 1. A high number means we will fill the whole in_env[] buffer. But we potentially keep looping and adding more to in_len so now it can be any value. It simple enough to change, but it feels weird that we keep looping even though in_env is totally full. Shouldn't we just return an error if we don't have space for desc->in_num. We keep looping because the size of the total input is allowed to be bigger than the 'envelope' which is a subset of the payload that tells us how much data to expect. For safety explicitly check that buf_len does not overflow which is what the checker flagged. Cc: <stable@vger.kernel.org> Fixes: 62232e45f4a2: "libnvdimm: control (ioctl) messages for nvdimm_bus..." Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-08-31libnvdimm, nd_blk: remove mmio_flush_range()Robin Murphy
mmio_flush_range() suffers from a lack of clearly-defined semantics, and is somewhat ambiguous to port to other architectures where the scope of the writeback implied by "flush" and ordering might matter, but MMIO would tend to imply non-cacheable anyway. Per the rationale in 67a3e8fe9015 ("nd_blk: change aperture mapping from WC to WB"), the only existing use is actually to invalidate clean cache lines for ARCH_MEMREMAP_PMEM type mappings *without* writeback. Since the recent cleanup of the pmem API, that also now happens to be the exact purpose of arch_invalidate_pmem(), which would be a far more well-defined tool for the job. Rather than risk potentially inconsistent implementations of mmio_flush_range() for the sake of one callsite, streamline things by removing it entirely and instead move the ARCH_MEMREMAP_PMEM related definitions up to the libnvdimm level, so they can be shared by NFIT as well. This allows NFIT to be enabled for arm64. Signed-off-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-08-31libnvdimm, btt: rework error clearingVishal Verma
Clearing errors or badblocks during a BTT write requires sending an ACPI DSM, which means potentially sleeping. Since a BTT IO happens in atomic context (preemption disabled, spinlocks may be held), we cannot perform error clearing in the course of an IO. Due to this error clearing for BTT IOs has hitherto been disabled. In this patch we move error clearing out of the atomic section, and thus re-enable error clearing with BTTs. When we are about to add a block to the free list, we check if it was previously marked as an error, and if it was, we add it to the freelist, but also set a flag that says error clearing will be required. We then drop the lane (ending the atomic context), and send a zero buffer so that the error can be cleared. The error flag in the free list is protected by the nd 'lane', and is set only be a thread while it holds that lane. When the error is cleared, the flag is cleared, but while holding a mutex for that freelist index. When writing, we check for two things - 1/ If the freelist mutex is held or if the error flag is set. If so, this is an error block that is being (or about to be) cleared. 2/ If the block is a known badblock based on nsio->bb The second check is required because the BTT map error flag for a map entry only gets set when an error LBA is read. If we write to a new location that may not have the map error flag set, but still might be in the region's badblock list, we can trigger an EIO on the write, which is undesirable and completely avoidable. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-08-31libnvdimm: fix potential deadlock while clearing errorsVishal Verma
With the ACPI NFIT 'DSM' methods, acpi can be called from IO paths. Specifically, the DSM to clear media errors is called during writes, so that we can provide a writes-fix-errors model. However it is easy to imagine a scenario like: -> write through the nvdimm driver -> acpi allocation -> writeback, causes more IO through the nvdimm driver -> deadlock Fix this by using memalloc_noio_{save,restore}, which sets the GFP_NOIO flag for the current scope when issuing commands/IOs that are expected to clear errors. Cc: <linux-acpi@vger.kernel.org> Cc: <linux-nvdimm@lists.01.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Robert Moore <robert.moore@intel.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-08-31libnvdimm, btt: cache sector_size in arena_infoVishal Verma
In preparation for the error clearing rework, add sector_size in the arena_info struct. Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-08-31libnvdimm, btt: ensure that flags were also unchanged during a map_readVishal Verma
In btt_map_read, we read the map twice to make sure that the map entry didn't change after we added it to the read tracking table. In anticipation of expanding the use of the error bit, also make sure that the error and zero flags are constant across the two map reads. Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>