summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)Author
2017-06-02Linux 4.8.24v4.8.24Paul Gortmaker
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02nvme/pci: Disable on removal when disconnectedKeith Busch
commit 6db28eda266052f86a6b402422de61eeb7d2e351 upstream. If the device is not present, the driver should disable the queues immediately. Prior to this, the driver was relying on the watchdog timer to kill the queues if requests were outstanding to the device, and that just delays removal up to one second. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02padata: avoid race in reorderingJason A. Donenfeld
commit de5540d088fe97ad583cc7d396586437b32149a5 upstream. Under extremely heavy uses of padata, crashes occur, and with list debugging turned on, this happens instead: [87487.298728] WARNING: CPU: 1 PID: 882 at lib/list_debug.c:33 __list_add+0xae/0x130 [87487.301868] list_add corruption. prev->next should be next (ffffb17abfc043d0), but was ffff8dba70872c80. (prev=ffff8dba70872b00). [87487.339011] [<ffffffff9a53d075>] dump_stack+0x68/0xa3 [87487.342198] [<ffffffff99e119a1>] ? console_unlock+0x281/0x6d0 [87487.345364] [<ffffffff99d6b91f>] __warn+0xff/0x140 [87487.348513] [<ffffffff99d6b9aa>] warn_slowpath_fmt+0x4a/0x50 [87487.351659] [<ffffffff9a58b5de>] __list_add+0xae/0x130 [87487.354772] [<ffffffff9add5094>] ? _raw_spin_lock+0x64/0x70 [87487.357915] [<ffffffff99eefd66>] padata_reorder+0x1e6/0x420 [87487.361084] [<ffffffff99ef0055>] padata_do_serial+0xa5/0x120 padata_reorder calls list_add_tail with the list to which its adding locked, which seems correct: spin_lock(&squeue->serial.lock); list_add_tail(&padata->list, &squeue->serial.list); spin_unlock(&squeue->serial.lock); This therefore leaves only place where such inconsistency could occur: if padata->list is added at the same time on two different threads. This pdata pointer comes from the function call to padata_get_next(pd), which has in it the following block: next_queue = per_cpu_ptr(pd->pqueue, cpu); padata = NULL; reorder = &next_queue->reorder; if (!list_empty(&reorder->list)) { padata = list_entry(reorder->list.next, struct padata_priv, list); spin_lock(&reorder->lock); list_del_init(&padata->list); atomic_dec(&pd->reorder_objects); spin_unlock(&reorder->lock); pd->processed++; goto out; } out: return padata; I strongly suspect that the problem here is that two threads can race on reorder list. Even though the deletion is locked, call to list_entry is not locked, which means it's feasible that two threads pick up the same padata object and subsequently call list_add_tail on them at the same time. The fix is thus be hoist that lock outside of that block. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02blk: improve order of bio handling in generic_make_request()NeilBrown
commit 79bd99596b7305ab08109a8bf44a6a4511dbf1cd upstream. To avoid recursion on the kernel stack when stacked block devices are in use, generic_make_request() will, when called recursively, queue new requests for later handling. They will be handled when the make_request_fn for the current bio completes. If any bios are submitted by a make_request_fn, these will ultimately be handled seqeuntially. If the handling of one of those generates further requests, they will be added to the end of the queue. This strict first-in-first-out behaviour can lead to deadlocks in various ways, normally because a request might need to wait for a previous request to the same device to complete. This can happen when they share a mempool, and can happen due to interdependencies particular to the device. Both md and dm have examples where this happens. These deadlocks can be erradicated by more selective ordering of bios. Specifically by handling them in depth-first order. That is: when the handling of one bio generates one or more further bios, they are handled immediately after the parent, before any siblings of the parent. That way, when generic_make_request() calls make_request_fn for some particular device, we can be certain that all previously submited requests for that device have been completely handled and are not waiting for anything in the queue of requests maintained in generic_make_request(). An easy way to achieve this would be to use a last-in-first-out stack instead of a queue. However this will change the order of consecutive bios submitted by a make_request_fn, which could have unexpected consequences. Instead we take a slightly more complex approach. A fresh queue is created for each call to a make_request_fn. After it completes, any bios for a different device are placed on the front of the main queue, followed by any bios for the same device, followed by all bios that were already on the queue before the make_request_fn was called. This provides the depth-first approach without reordering bios on the same level. This, by itself, it not enough to remove all deadlocks. It just makes it possible for drivers to take the extra step required themselves. To avoid deadlocks, drivers must never risk waiting for a request after submitting one to generic_make_request. This includes never allocing from a mempool twice in the one call to a make_request_fn. A common pattern in drivers is to call bio_split() in a loop, handling the first part and then looping around to possibly split the next part. Instead, a driver that finds it needs to split a bio should queue (with generic_make_request) the second part, handle the first part, and then return. The new code in generic_make_request will ensure the requests to underlying bios are processed first, then the second bio that was split off. If it splits again, the same process happens. In each case one bio will be completely handled before the next one is attempted. With this is place, it should be possible to disable the punt_bios_to_recover() recovery thread for many block devices, and eventually it may be possible to remove it completely. Ref: http://www.spinics.net/lists/raid/msg54680.html Tested-by: Jinpu Wang <jinpu.wang@profitbricks.com> Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02mm: workingset: fix premature shadow node shrinking with cgroupsJohannes Weiner
commit 0cefabdaf757a6455d75f00cb76874e62703ed18 upstream. Commit 0a6b76dd23fa ("mm: workingset: make shadow node shrinker memcg aware") enabled cgroup-awareness in the shadow node shrinker, but forgot to also enable cgroup-awareness in the list_lru the shadow nodes sit on. Consequently, all shadow nodes are sitting on a global (per-NUMA node) list, while the shrinker applies the limits according to the amount of cache in the cgroup its shrinking. The result is excessive pressure on the shadow nodes from cgroups that have very little cache. Enable memcg-mode on the shadow node LRUs, such that per-cgroup limits are applied to per-cgroup lists. Fixes: 0a6b76dd23fa ("mm: workingset: make shadow node shrinker memcg aware") Link: http://lkml.kernel.org/r/20170322005320.8165-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov@tarantool.org> Cc: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> [4.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02MIPS: Lantiq: Fix cascaded IRQ setupFelix Fietkau
commit 6c356eda225e3ee134ed4176b9ae3a76f793f4dd upstream. With the IRQ stack changes integrated, the XRX200 devices started emitting a constant stream of kernel messages like this: [ 565.415310] Spurious IRQ: CAUSE=0x1100c300 This is caused by IP0 getting handled by plat_irq_dispatch() rather than its vectored interrupt handler, which is fixed by commit de856416e714 ("MIPS: IRQ Stack: Fix erroneous jal to plat_irq_dispatch"). Fix plat_irq_dispatch() to handle non-vectored IPI interrupts correctly by setting up IP2-6 as proper chained IRQ handlers and calling do_IRQ for all MIPS CPU interrupts. Signed-off-by: Felix Fietkau <nbd@nbd.name> Acked-by: John Crispin <john@phrozen.org> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/15077/ [james.hogan@imgtec.com: tweaked commit message] Signed-off-by: James Hogan <james.hogan@imgtec.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02ARM: dts: BCM5301X: Correct GIC_PPI interrupt flagsJon Mason
commit 0c2bf9f95983fe30aa2f6463cb761cd42c2d521a upstream. GIC_PPI flags were misconfigured for the timers, resulting in errors like: [ 0.000000] GIC: PPI11 is secure or misconfigured Changing them to being edge triggered corrects the issue Suggested-by: Rafał Miłecki <rafal@milecki.pl> Signed-off-by: Jon Mason <jon.mason@broadcom.com> Fixes: d27509f1 ("ARM: BCM5301X: add dts files for BCM4708 SoC") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02ARM: BCM5301X: Add back handler ignoring external imprecise abortsRafał Miłecki
commit 09f3510fb70a46c8921f2cf4a90dbcae460a6820 upstream. Since early BCM5301X days we got abort handler that was removed by commit 937b12306ea79 ("ARM: BCM5301X: remove workaround imprecise abort fault handler"). It assumed we need to deal only with pending aborts left by the bootloader. Unfortunately this isn't true for BCM5301X. When probing PCI config space (device enumeration) it is expected to have master aborts on the PCI bus. Most bridges don't forward (or they allow disabling it) these errors onto the AXI/AMBA bus but not the Northstar (BCM5301X) one. iProc PCIe controller on Northstar seems to be some older one, without a control register for errors forwarding. It means we need to workaround this at platform level. All newer platforms are not affected by this issue. Signed-off-by: Rafał Miłecki <rafal@milecki.pl> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02mm, hugetlb: use pte_present() instead of pmd_present() in follow_huge_pmd()Naoya Horiguchi
commit c9d398fa237882ea07167e23bcfc5e6847066518 upstream. I found the race condition which triggers the following bug when move_pages() and soft offline are called on a single hugetlb page concurrently. Soft offlining page 0x119400 at 0x700000000000 BUG: unable to handle kernel paging request at ffffea0011943820 IP: follow_huge_pmd+0x143/0x190 PGD 7ffd2067 PUD 7ffd1067 PMD 0 [61163.582052] Oops: 0000 [#1] SMP Modules linked in: binfmt_misc ppdev virtio_balloon parport_pc pcspkr i2c_piix4 parport i2c_core acpi_cpufreq ip_tables xfs libcrc32c ata_generic pata_acpi virtio_blk 8139too crc32c_intel ata_piix serio_raw libata virtio_pci 8139cp virtio_ring virtio mii floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: cap_check] CPU: 0 PID: 22573 Comm: iterate_numa_mo Tainted: P OE 4.11.0-rc2-mm1+ #2 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:follow_huge_pmd+0x143/0x190 RSP: 0018:ffffc90004bdbcd0 EFLAGS: 00010202 RAX: 0000000465003e80 RBX: ffffea0004e34d30 RCX: 00003ffffffff000 RDX: 0000000011943800 RSI: 0000000000080001 RDI: 0000000465003e80 RBP: ffffc90004bdbd18 R08: 0000000000000000 R09: ffff880138d34000 R10: ffffea0004650000 R11: 0000000000c363b0 R12: ffffea0011943800 R13: ffff8801b8d34000 R14: ffffea0000000000 R15: 000077ff80000000 FS: 00007fc977710740(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffea0011943820 CR3: 000000007a746000 CR4: 00000000001406f0 Call Trace: follow_page_mask+0x270/0x550 SYSC_move_pages+0x4ea/0x8f0 SyS_move_pages+0xe/0x10 do_syscall_64+0x67/0x180 entry_SYSCALL64_slow_path+0x25/0x25 RIP: 0033:0x7fc976e03949 RSP: 002b:00007ffe72221d88 EFLAGS: 00000246 ORIG_RAX: 0000000000000117 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc976e03949 RDX: 0000000000c22390 RSI: 0000000000001400 RDI: 0000000000005827 RBP: 00007ffe72221e00 R08: 0000000000c2c3a0 R09: 0000000000000004 R10: 0000000000c363b0 R11: 0000000000000246 R12: 0000000000400650 R13: 00007ffe72221ee0 R14: 0000000000000000 R15: 0000000000000000 Code: 81 e4 ff ff 1f 00 48 21 c2 49 c1 ec 0c 48 c1 ea 0c 4c 01 e2 49 bc 00 00 00 00 00 ea ff ff 48 c1 e2 06 49 01 d4 f6 45 bc 04 74 90 <49> 8b 7c 24 20 40 f6 c7 01 75 2b 4c 89 e7 8b 47 1c 85 c0 7e 2a RIP: follow_huge_pmd+0x143/0x190 RSP: ffffc90004bdbcd0 CR2: ffffea0011943820 ---[ end trace e4f81353a2d23232 ]--- Kernel panic - not syncing: Fatal exception Kernel Offset: disabled This bug is triggered when pmd_present() returns true for non-present hugetlb, so fixing the present check in follow_huge_pmd() prevents it. Using pmd_present() to determine present/non-present for hugetlb is not correct, because pmd_present() checks multiple bits (not only _PAGE_PRESENT) for historical reason and it can misjudge hugetlb state. Fixes: e66f17ff7177 ("mm/hugetlb: take page table lock in follow_huge_pmd()") Link: http://lkml.kernel.org/r/1490149898-20231-1-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: <stable@vger.kernel.org> [4.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02mm: rmap: fix huge file mmap accounting in the memcg statsJohannes Weiner
commit 553af430e7c981e6e8fa5007c5b7b5773acc63dd upstream. Huge pages are accounted as single units in the memcg's "file_mapped" counter. Account the correct number of base pages, like we do in the corresponding node counter. Link: http://lkml.kernel.org/r/20170322005111.3156-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> [4.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02x86/mce: Fix copy/paste error in exception table entriesTony Luck
commit 26a37ab319a26d330bab298770d692bb9c852aff upstream. Back in commit: 92b0729c34cab ("x86/mm, x86/mce: Add memcpy_mcsafe()") ... I made a copy/paste error setting up the exception table entries and ended up with two for label .L_cache_w3 and none for .L_cache_w2. This means that if we take a machine check on: .L_cache_w2: movq 2*8(%rsi), %r10 then we don't have an exception table entry for this instruction and we can't recover. Fix: s/3/2/ Signed-off-by: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 92b0729c34cab ("x86/mm, x86/mce: Add memcpy_mcsafe()") Link: http://lkml.kernel.org/r/1490046030-25862-1-git-send-email-tony.luck@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02x86/mm/KASLR: Exclude EFI region from KASLR VA space randomizationBaoquan He
commit a46f60d76004965e5669dbf3fc21ef3bc3632eb4 upstream. Currently KASLR is enabled on three regions: the direct mapping of physical memory, vamlloc and vmemmap. However the EFI region is also mistakenly included for VA space randomization because of misusing EFI_VA_START macro and assuming EFI_VA_START < EFI_VA_END. (This breaks kexec and possibly other things that rely on stable addresses.) The EFI region is reserved for EFI runtime services virtual mapping which should not be included in KASLR ranges. In Documentation/x86/x86_64/mm.txt, we can see: ffffffef00000000 - fffffffeffffffff (=64 GB) EFI region mapping space EFI uses the space from -4G to -64G thus EFI_VA_START > EFI_VA_END, Here EFI_VA_START = -4G, and EFI_VA_END = -64G. Changing EFI_VA_START to EFI_VA_END in mm/kaslr.c fixes this problem. Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Bhupesh Sharma <bhsharma@redhat.com> Acked-by: Dave Young <dyoung@redhat.com> Acked-by: Thomas Garnier <thgarnie@google.com> Cc: <stable@vger.kernel.org> #4.8+ Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/1490331592-31860-1-git-send-email-bhe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02x86/mm/64: Enable KASLR for vmemmap memory regionThomas Garnier
commit 25dfe4785332723f09311dcb7fd91015a60c022f upstream. Add vmemmap in the list of randomized memory regions. The vmemmap region holds a representation of the physical memory (through a struct page array). An attacker could use this region to disclose the kernel memory layout (walking the page linked list). Signed-off-by: Thomas Garnier <thgarnie@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-hardening@lists.openwall.com Link: http://lkml.kernel.org/r/1469635196-122447-1-git-send-email-thgarnie@google.com [ Minor edits. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02drm/etnaviv: (re-)protect fence allocation with GPU mutexLucas Stach
commit f3cd1b064f1179d9e6188c6d67297a2360880e10 upstream. The fence allocation needs to be protected by the GPU mutex, otherwise the fence seqnos of concurrent submits might not match the insertion order of the jobs in the kernel ring. This breaks the assumption that jobs complete with monotonically increasing fence seqnos. Fixes: d9853490176c (drm/etnaviv: take GPU lock later in the submit process) CC: stable@vger.kernel.org #4.9+ Signed-off-by: Lucas Stach <l.stach@pengutronix.de> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02drm/vc4: Allocate the right amount of space for boot-time CRTC state.Eric Anholt
commit 6d6e500391875cc372336c88e9a8af377be19c36 upstream. Without this, the first modeset would dereference past the allocation when trying to free the mm node. Signed-off-by: Eric Anholt <eric@anholt.net> Tested-by: Stefan Wahren <stefan.wahren@i2se.com> Link: http://patchwork.freedesktop.org/patch/msgid/20170328201343.4884-1-eric@anholt.net Fixes: d8dbf44f13b9 ("drm/vc4: Make the CRTCs cooperate on allocating display lists.") Cc: <stable@vger.kernel.org> # v4.6+ Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02drm/radeon: Override fpfn for all VRAM placements in radeon_evict_flagsMichel Dänzer
commit ce4b4f228e51219b0b79588caf73225b08b5b779 upstream. We were accidentally only overriding the first VRAM placement. For BOs with the RADEON_GEM_NO_CPU_ACCESS flag set, radeon_ttm_placement_from_domain creates a second VRAM placment with fpfn == 0. If VRAM is almost full, the first VRAM placement with fpfn > 0 may not work, but the second one with fpfn == 0 always will (the BO's current location trivially satisfies it). Because "moving" the BO to its current location puts it back on the LRU list, this results in an infinite loop. Fixes: 2a85aedd117c ("drm/radeon: Try evicting from CPU accessible to inaccessible VRAM first") Reported-by: Zachary Michaels <zmichaels@oblong.com> Reported-and-Tested-by: Julien Isorce <jisorce@oblong.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Michel Dänzer <michel.daenzer@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02KVM: kvm_io_bus_unregister_dev() should never failDavid Hildenbrand
commit 90db10434b163e46da413d34db8d0e77404cc645 upstream. No caller currently checks the return value of kvm_io_bus_unregister_dev(). This is evil, as all callers silently go on freeing their device. A stale reference will remain in the io_bus, getting at least used again, when the iobus gets teared down on kvm_destroy_vm() - leading to use after free errors. There is nothing the callers could do, except retrying over and over again. So let's simply remove the bus altogether, print an error and make sure no one can access this broken bus again (returning -ENOMEM on any attempt to access it). Fixes: e93f8a0f821e ("KVM: convert io_bus to SRCU") Cc: stable@vger.kernel.org # 3.4+ Reported-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02KVM: x86: clear bus pointer when destroyedPeter Xu
commit df630b8c1e851b5e265dc2ca9c87222e342c093b upstream. When releasing the bus, let's clear the bus pointers to mark it out. If any further device unregister happens on this bus, we know that we're done if we found the bus being released already. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02serial: mxs-auart: Fix baudrate calculationUwe Kleine-König
commit a6040bc610554c66088fda3608ae5d6307c548e4 upstream. The reference manual for the i.MX28 recommends to calculate the divisor as divisor = (UARTCLK * 32) / baud rate, rounded to the nearest integer , so let's do this. For a typical setup of UARTCLK = 24 MHz and baud rate = 115200 this changes the divisor from 6666 to 6667 and so the actual baud rate improves from 115211.521 Bd (error ≅ 0.01 %) to 115194.240 Bd (error ≅ 0.005 %). Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Cc: stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02USB: fix linked-list corruption in rh_call_control()Alan Stern
commit 1633682053a7ee8058e10c76722b9b28e97fb73f upstream. Using KASAN, Dmitry found a bug in the rh_call_control() routine: If buffer allocation fails, the routine returns immediately without unlinking its URB from the control endpoint, eventually leading to linked-list corruption. This patch fixes the problem by jumping to the end of the routine (where the URB is unlinked) when an allocation failure occurs. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Reported-and-tested-by: Dmitry Vyukov <dvyukov@google.com> CC: <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02tty/serial: atmel: fix race condition (TX+DMA)Richard Genoud
commit 31ca2c63fdc0aee725cbd4f207c1256f5deaabde upstream. If uart_flush_buffer() is called between atmel_tx_dma() and atmel_complete_tx_dma(), the circular buffer has been cleared, but not atmel_port->tx_len. That leads to a circular buffer overflow (dumping (UART_XMIT_SIZE - atmel_port->tx_len) bytes). Tested-by: Nicolas Ferre <nicolas.ferre@microchip.com> Signed-off-by: Richard Genoud <richard.genoud@gmail.com> Cc: stable <stable@vger.kernel.org> # 3.12+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02ACPI: Do not create a platform_device for IOAPIC/IOxAPICJoerg Roedel
commit 08f63d97749185fab942a3a47ed80f5bd89b8b7d upstream. No platform-device is required for IO(x)APICs, so don't even create them. [ rjw: This fixes a problem with leaking platform device objects after IOAPIC/IOxAPIC hot-removal events.] Signed-off-by: Joerg Roedel <jroedel@suse.de> Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02ACPI: Fix incompatibility with mcount-based function graph tracingJosh Poimboeuf
commit 61b79e16c68d703dde58c25d3935d67210b7d71b upstream. Paul Menzel reported a warning: WARNING: CPU: 0 PID: 774 at /build/linux-ROBWaj/linux-4.9.13/kernel/trace/trace_functions_graph.c:233 ftrace_return_to_handler+0x1aa/0x1e0 Bad frame pointer: expected f6919d98, received f6919db0 from func acpi_pm_device_sleep_wake return to c43b6f9d The warning means that function graph tracing is broken for the acpi_pm_device_sleep_wake() function. That's because the ACPI Makefile unconditionally sets the '-Os' gcc flag to optimize for size. That's an issue because mcount-based function graph tracing is incompatible with '-Os' on x86, thanks to the following gcc bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42109 I have another patch pending which will ensure that mcount-based function graph tracing is never used with CONFIG_CC_OPTIMIZE_FOR_SIZE on x86. But this patch is needed in addition to that one because the ACPI Makefile overrides that config option for no apparent reason. It has had this flag since the beginning of git history, and there's no related comment, so I don't know why it's there. As far as I can tell, there's no reason for it to be there. The appropriate behavior is for it to honor CONFIG_CC_OPTIMIZE_FOR_{SIZE,PERFORMANCE} like the rest of the kernel. Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02nfsd: map the ENOKEY to nfserr_perm for avoiding warningKinglong Mee
commit c952cd4e949ab3d07287efc2e80246e03727d15d upstream. Now that Ext4 and f2fs filesystems support encrypted directories and files, attempts to access those files may return ENOKEY, resulting in the following WARNING. Map ENOKEY to nfserr_perm instead of nfserr_io. [ 1295.411759] ------------[ cut here ]------------ [ 1295.411787] WARNING: CPU: 0 PID: 12786 at fs/nfsd/nfsproc.c:796 nfserrno+0x74/0x80 [nfsd] [ 1295.411806] nfsd: non-standard errno: -126 [ 1295.411816] Modules linked in: nfsd nfs_acl auth_rpcgss nfsv4 nfs lockd fscache tun bridge stp llc fuse ip_set nfnetlink vmw_vsock_vmci_transport vsock snd_seq_midi snd_seq_midi_event coretemp crct10dif_pclmul crc32_generic crc32_pclmul snd_ens1371 gameport ghash_clmulni_intel snd_ac97_codec f2fs intel_rapl_perf ac97_bus snd_seq ppdev snd_pcm snd_rawmidi snd_timer vmw_balloon snd_seq_device snd joydev soundcore parport_pc parport nfit acpi_cpufreq tpm_tis vmw_vmci tpm_tis_core tpm shpchp i2c_piix4 grace sunrpc xfs libcrc32c vmwgfx drm_kms_helper ttm drm crc32c_intel e1000 mptspi scsi_transport_spi serio_raw mptscsih mptbase ata_generic pata_acpi fjes [last unloaded: nfs_acl] [ 1295.412522] CPU: 0 PID: 12786 Comm: nfsd Tainted: G W 4.11.0-rc1+ #521 [ 1295.412959] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 1295.413814] Call Trace: [ 1295.414252] dump_stack+0x63/0x86 [ 1295.414666] __warn+0xcb/0xf0 [ 1295.415087] warn_slowpath_fmt+0x5f/0x80 [ 1295.415502] ? put_filp+0x42/0x50 [ 1295.415927] nfserrno+0x74/0x80 [nfsd] [ 1295.416339] nfsd_open+0xd7/0x180 [nfsd] [ 1295.416746] nfs4_get_vfs_file+0x367/0x3c0 [nfsd] [ 1295.417182] ? security_inode_permission+0x41/0x60 [ 1295.417591] nfsd4_process_open2+0x9b2/0x1200 [nfsd] [ 1295.418007] nfsd4_open+0x481/0x790 [nfsd] [ 1295.418409] nfsd4_proc_compound+0x395/0x680 [nfsd] [ 1295.418812] nfsd_dispatch+0xb8/0x1f0 [nfsd] [ 1295.419233] svc_process_common+0x4d9/0x830 [sunrpc] [ 1295.419631] svc_process+0xfe/0x1b0 [sunrpc] [ 1295.420033] nfsd+0xe9/0x150 [nfsd] [ 1295.420420] kthread+0x101/0x140 [ 1295.420802] ? nfsd_destroy+0x60/0x60 [nfsd] [ 1295.421199] ? kthread_park+0x90/0x90 [ 1295.421598] ret_from_fork+0x2c/0x40 [ 1295.421996] ---[ end trace 0d5a969cd7852e1f ]--- Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02mmc: sdhci-of-at91: fix MMC_DDR_52 timing selectionLudovic Desroches
commit d0918764c17b94c30bbb2619929b1719ff52707a upstream. The controller has different timings for MMC_TIMING_UHS_DDR50 and MMC_TIMING_MMC_DDR52. Configuring the controller with SDHCI_CTRL_UHS_DDR50, when MMC_TIMING_MMC_DDR52 timings are requested, is not correct and can lead to unexpected behavior. Signed-off-by: Ludovic Desroches <ludovic.desroches@microchip.com> Fixes: bb5f8ea4d514 ("mmc: sdhci-of-at91: introduce driver for the Atmel SDMMC") Cc: <stable@vger.kernel.org> # 4.4+ Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02mmc: sdhci: Disable runtime pm when the sdio_irq is enabledHans de Goede
commit 923713b357455cfb9aca2cd3429cb0806a724ed2 upstream. SDIO cards may need clock to send the card interrupt to the host. On a cherrytrail tablet with a RTL8723BS wifi chip, without this patch pinging the tablet results in: PING 192.168.1.14 (192.168.1.14) 56(84) bytes of data. 64 bytes from 192.168.1.14: icmp_seq=1 ttl=64 time=78.6 ms 64 bytes from 192.168.1.14: icmp_seq=2 ttl=64 time=1760 ms 64 bytes from 192.168.1.14: icmp_seq=3 ttl=64 time=753 ms 64 bytes from 192.168.1.14: icmp_seq=4 ttl=64 time=3.88 ms 64 bytes from 192.168.1.14: icmp_seq=5 ttl=64 time=795 ms 64 bytes from 192.168.1.14: icmp_seq=6 ttl=64 time=1841 ms 64 bytes from 192.168.1.14: icmp_seq=7 ttl=64 time=810 ms 64 bytes from 192.168.1.14: icmp_seq=8 ttl=64 time=1860 ms 64 bytes from 192.168.1.14: icmp_seq=9 ttl=64 time=812 ms 64 bytes from 192.168.1.14: icmp_seq=10 ttl=64 time=48.6 ms Where as with this patch I get: PING 192.168.1.14 (192.168.1.14) 56(84) bytes of data. 64 bytes from 192.168.1.14: icmp_seq=1 ttl=64 time=3.96 ms 64 bytes from 192.168.1.14: icmp_seq=2 ttl=64 time=1.97 ms 64 bytes from 192.168.1.14: icmp_seq=3 ttl=64 time=17.2 ms 64 bytes from 192.168.1.14: icmp_seq=4 ttl=64 time=2.46 ms 64 bytes from 192.168.1.14: icmp_seq=5 ttl=64 time=2.83 ms 64 bytes from 192.168.1.14: icmp_seq=6 ttl=64 time=1.40 ms 64 bytes from 192.168.1.14: icmp_seq=7 ttl=64 time=2.10 ms 64 bytes from 192.168.1.14: icmp_seq=8 ttl=64 time=1.40 ms 64 bytes from 192.168.1.14: icmp_seq=9 ttl=64 time=2.04 ms 64 bytes from 192.168.1.14: icmp_seq=10 ttl=64 time=1.40 ms Cc: Dong Aisheng <b29396@freescale.com> Cc: Ian W MORRISON <ianwmorrison@gmail.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Acked-by: Adrian Hunter <adrian.hunter@intel.com> Acked-by: Dong Aisheng <aisheng.dong@nxp.com> Cc: stable@vger.kernel.org Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02ASoC: Intel: Skylake: fix invalid memory access due to wrong reference of ↵Takashi Sakamoto
pointer commit d1a6fe41d3c4ff0d26f0b186d774493555ca5282 upstream. In 'skl_tplg_set_module_init_data()', a pointer to 'params' member of 'struct skl_algo_data' is calculated, then casted to (u32 *) and assigned to a member of configuration data. The configuration data is passed to the other functions and used to process intel IPC. In this processing, the value of member is used to get message data, however this can bring invalid memory access in 'skl_set_module_params()' as a result of calculation of a pointer for actual message data. (sound/soc/intel/skylake/skl-topology.c) skl_tplg_init_pipe_modules() ->skl_tplg_set_module_init_data() (has this bug) ->skl_tplg_set_module_params() (sound/soc/intel/skylake/skl-messages.c) ->skl_set_module_params() ((char *)param) + data_offset This commit fixes the bug. Fixes: abb740033b56 ("ASoC: Intel: Skylake: Add support to configure module params") Signed-off-by: Takashi Sakamoto <takashi.sakamoto@miraclelinux.com> Acked-by: Vinod Koul <vinod.koul@intel.com> Signed-off-by: Mark Brown <broonie@kernel.org> Cc: <stable@vger.kernel.org> # v4.5+ Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02ASoC: atmel-classd: fix audio clock rateSongjun Wu
commit cd3ac9affc43b44f49d7af70d275f0bd426ba643 upstream. Fix the audio clock rate according to the datasheet. Reported-by: Dushara Jayasinghe <dushara@successful.com.au> Signed-off-by: Songjun Wu <songjun.wu@microchip.com> Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com> Signed-off-by: Mark Brown <broonie@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02ALSA: hda - fix a problem for lineout on a Dell AIO machineHui Wang
commit 2f726aec19a9d2c63bec9a8a53a3910ffdcd09f8 upstream. On this Dell AIO machine, the lineout jack does not work. We found the pin 0x1a is assigned to lineout on this machine, and in the past, we applied ALC298_FIXUP_DELL1_MIC_NO_PRESENCE to fix the heaset-set mic problem for this machine, this fixup will redefine the pin 0x1a to headphone-mic, as a result the lineout doesn't work anymore. After consulting with Dell, they told us this machine doesn't support microphone via headset jack, so we add a new fixup which only defines the pin 0x18 as the headset-mic. [rearranged the fixup insertion position by tiwai in order to make the merge with other branches easier -- tiwai] Fixes: 59ec4b57bcae ("ALSA: hda - Fix headset mic detection problem for two dell machines") Cc: <stable@vger.kernel.org> Signed-off-by: Hui Wang <hui.wang@canonical.com> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02ALSA: seq: Fix race during FIFO resizeTakashi Iwai
commit 2d7d54002e396c180db0c800c1046f0a3c471597 upstream. When a new event is queued while processing to resize the FIFO in snd_seq_fifo_clear(), it may lead to a use-after-free, as the old pool that is being queued gets removed. For avoiding this race, we need to close the pool to be deleted and sync its usage before actually deleting it. The issue was spotted by syzkaller. Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02PCI: iproc: Save host bridge window resource in struct iproc_pcieBjorn Helgaas
commit 6e347b5e05ea2ac4ac467a5a1cfaebb2c7f06f80 upstream. The host bridge memory window resource is inserted into the iomem_resource tree and cannot be deallocated until the host bridge itself is removed. Previously, the window was on the stack, which meant the iomem_resource entry pointed into the stack and was corrupted as soon as the probe function returned, which caused memory corruption and errors like this: pcie_iproc_bcma bcma0:8: resource collision: [mem 0x40000000-0x47ffffff] conflicts with PCIe MEM space [mem 0x40000000-0x47ffffff] Move the memory window resource from the stack into struct iproc_pcie so its lifetime matches that of the host bridge. Fixes: c3245a566400 ("PCI: iproc: Request host bridge window resources") Reported-and-tested-by: Rafał Miłecki <zajec5@gmail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> CC: stable@vger.kernel.org # v4.8+ Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02scsi: scsi_dh_alua: Ensure that alua_activate() calls the completion functionBart Van Assche
commit 7cb689fe42927281b8d98606ae5450173fcc66a9 upstream. Callers of scsi_dh_activate(), e.g. dm-mpath, assume that this function either returns an error code or calls the completion function. Make alua_activate() call the completion function even if scsi_device_get() fails. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Tang Junhui <tang.junhui@zte.com.cn> Cc: <stable@vger.kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02scsi: scsi_dh_alua: Check scsi_device_get() return valueBart Van Assche
commit 625fe857e4fac6518716f3c0ff5e5deb8ec6d238 upstream. Do not queue ALUA work nor call scsi_device_put() if the scsi_device_get() call fails. This patch fixes the following crash: general protection fault: 0000 [#1] SMP RIP: 0010:scsi_device_put+0xb/0x30 Call Trace: scsi_disk_put+0x2d/0x40 sd_release+0x3d/0xb0 __blkdev_put+0x29e/0x360 blkdev_put+0x49/0x170 dm_put_table_device+0x58/0xc0 [dm_mod] dm_put_device+0x70/0xc0 [dm_mod] free_priority_group+0x92/0xc0 [dm_multipath] free_multipath+0x70/0xc0 [dm_multipath] multipath_dtr+0x19/0x20 [dm_multipath] dm_table_destroy+0x67/0x120 [dm_mod] dev_suspend+0xde/0x240 [dm_mod] ctl_ioctl+0x1f5/0x520 [dm_mod] dm_ctl_ioctl+0xe/0x20 [dm_mod] do_vfs_ioctl+0x8f/0x700 SyS_ioctl+0x3c/0x70 entry_SYSCALL_64_fastpath+0x18/0xad Fixes: commit 03197b61c5ec ("scsi_dh_alua: Use workqueue for RTPG") Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Tang Junhui <tang.junhui@zte.com.cn> Cc: <stable@vger.kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02scsi: libsas: fix ata xfer lengthJohn Garry
commit 9702c67c6066f583b629cf037d2056245bb7a8e6 upstream. The total ata xfer length may not be calculated properly, in that we do not use the proper method to get an sg element dma length. According to the code comment, sg_dma_len() should be used after dma_map_sg() is called. This issue was found by turning on the SMMUv3 in front of the hisi_sas controller in hip07. Multiple sg elements were being combined into a single element, but the original first element length was being use as the total xfer length. Cc: <stable@vger.kernel.org> Fixes: ff2aeb1eb64c8a4770a6 ("libata: convert to chained sg") Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02scsi: sg: check length passed to SG_NEXT_CMD_LENpeter chang
commit bf33f87dd04c371ea33feb821b60d63d754e3124 upstream. The user can control the size of the next command passed along, but the value passed to the ioctl isn't checked against the usable max command size. Cc: <stable@vger.kernel.org> Signed-off-by: Peter Chang <dpf@google.com> Acked-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02xfs: Use xfs_icluster_size_fsb() to calculate inode alignment maskChandan Rajendra
commit d5825712ee98d68a2c17bc89dad2c30276894cba upstream. When block size is larger than inode cluster size, the call to XFS_B_TO_FSBT(mp, mp->m_inode_cluster_size) returns 0. Also, mkfs.xfs would have set xfs_sb->sb_inoalignmt to 0. Hence in xfs_set_inoalignment(), xfs_mount->m_inoalign_mask gets initialized to -1 instead of 0. However, xfs_mount->m_sinoalign would get correctly intialized to 0 because for every positive value of xfs_mount->m_dalign, the condition "!(mp->m_dalign & mp->m_inoalign_mask)" would evaluate to false. Also, xfs_imap() worked fine even with xfs_mount->m_inoalign_mask having -1 as the value because blks_per_cluster variable would have the value 1 and hence we would never have a need to use xfs_mount->m_inoalign_mask to compute the inode chunk's agbno and offset within the chunk. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02xfs: tune down agno asserts in the bmap codeChristoph Hellwig
commit 410d17f67e583559be3a922f8b6cc336331893f3 upstream. In various places we currently assert that xfs_bmap_btalloc allocates from the same as the firstblock value passed in, unless it's either NULLAGNO or the dop_low flag is set. But the reflink code does not fully follow this convention as it passes in firstblock purely as a hint for the allocator without actually having previous allocations in the transaction, and without having a minleft check on the current AG, leading to the assert firing on a very full and heavily used file system. As even the reflink code only allocates from equal or higher AGs for now we can simply the check to always allow for equal or higher AGs. Note that we need to eventually split the two meanings of the firstblock value. At that point we can also allow the reflink code to allocate from any AG instead of limiting it in any way. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02xfs: Use xfs_icluster_size_fsb() to calculate inode chunk alignmentChandan Rajendra
commit 8ee9fdbebc84b39f1d1c201c5e32277c61d034aa upstream. On a ppc64 system, executing generic/256 test with 32k block size gives the following call trace, XFS: Assertion failed: args->maxlen > 0, file: /root/repos/linux/fs/xfs/libxfs/xfs_alloc.c, line: 2026 kernel BUG at /root/repos/linux/fs/xfs/xfs_message.c:113! Oops: Exception in kernel mode, sig: 5 [#1] SMP NR_CPUS=2048 DEBUG_PAGEALLOC NUMA pSeries Modules linked in: CPU: 2 PID: 19361 Comm: mkdir Not tainted 4.10.0-rc5 #58 task: c000000102606d80 task.stack: c0000001026b8000 NIP: c0000000004ef798 LR: c0000000004ef798 CTR: c00000000082b290 REGS: c0000001026bb090 TRAP: 0700 Not tainted (4.10.0-rc5) MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI> CR: 28004428 XER: 00000000 CFAR: c0000000004ef180 SOFTE: 1 GPR00: c0000000004ef798 c0000001026bb310 c000000001157300 ffffffffffffffea GPR04: 000000000000000a c0000001026bb130 0000000000000000 ffffffffffffffc0 GPR08: 00000000000000d1 0000000000000021 00000000ffffffd1 c000000000dd4990 GPR12: 0000000022004444 c00000000fe00800 0000000020000000 0000000000000000 GPR16: 0000000000000000 0000000043a606fc 0000000043a76c08 0000000043a1b3d0 GPR20: 000001002a35cd60 c0000001026bbb80 0000000000000000 0000000000000001 GPR24: 0000000000000240 0000000000000004 c00000062dc55000 0000000000000000 GPR28: 0000000000000004 c00000062ecd9200 0000000000000000 c0000001026bb6c0 NIP [c0000000004ef798] .assfail+0x28/0x30 LR [c0000000004ef798] .assfail+0x28/0x30 Call Trace: [c0000001026bb310] [c0000000004ef798] .assfail+0x28/0x30 (unreliable) [c0000001026bb380] [c000000000455d74] .xfs_alloc_space_available+0x194/0x1b0 [c0000001026bb410] [c00000000045b914] .xfs_alloc_fix_freelist+0x144/0x480 [c0000001026bb580] [c00000000045c368] .xfs_alloc_vextent+0x698/0xa90 [c0000001026bb650] [c0000000004a6200] .xfs_ialloc_ag_alloc+0x170/0x820 [c0000001026bb7c0] [c0000000004a9098] .xfs_dialloc+0x158/0x320 [c0000001026bb8a0] [c0000000004e628c] .xfs_ialloc+0x7c/0x610 [c0000001026bb990] [c0000000004e8138] .xfs_dir_ialloc+0xa8/0x2f0 [c0000001026bbaa0] [c0000000004e8814] .xfs_create+0x494/0x790 [c0000001026bbbf0] [c0000000004e5ebc] .xfs_generic_create+0x2bc/0x410 [c0000001026bbce0] [c0000000002b4a34] .vfs_mkdir+0x154/0x230 [c0000001026bbd70] [c0000000002bc444] .SyS_mkdirat+0x94/0x120 [c0000001026bbe30] [c00000000000b760] system_call+0x38/0xfc Instruction dump: 4e800020 60000000 7c0802a6 7c862378 3c82ffca 7ca72b78 38841c18 7c651b78 38600000 f8010010 f821ff91 4bfff94d <0fe00000> 60000000 7c0802a6 7c892378 When block size is larger than inode cluster size, the call to XFS_B_TO_FSBT(mp, mp->m_inode_cluster_size) returns 0. Also, mkfs.xfs would have set xfs_sb->sb_inoalignmt to 0. This causes xfs_ialloc_cluster_alignment() to return 0. Due to this args.minalignslop (in xfs_ialloc_ag_alloc()) gets the unsigned equivalent of -1 assigned to it. This later causes alloc_len in xfs_alloc_space_available() to have a value of 0. In such a scenario when args.total is also 0, the assert statement "ASSERT(args->maxlen > 0);" fails. This commit fixes the bug by replacing the call to XFS_B_TO_FSBT() in xfs_ialloc_cluster_alignment() with a call to xfs_icluster_size_fsb(). Suggested-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02xfs: don't reserve blocks for right shift transactionsBrian Foster
commit 48af96ab92bc68fb645068b978ce36df2379e076 upstream. The block reservation for the transaction allocated in xfs_shift_file_space() is an artifact of the original collapse range support. It exists to handle the case where a collapse range occurs, the initial extent is left shifted into a location that forms a contiguous boundary with the previous extent and thus the extents are merged. This code was subsequently refactored and reused for insert range (right shift) support. If an insert range occurs under low free space conditions, the extent at the starting offset is split before the first shift transaction is allocated. If the block reservation fails, this leaves separate, but contiguous extents around in the inode. While not a fatal problem, this is unexpected and will flag a warning on subsequent insert range operations on the inode. This problem has been reproduce intermittently by generic/270 running against a ramdisk device. Since right shift does not create new extent boundaries in the inode, a block reservation for extent merge is unnecessary. Update xfs_shift_file_space() to conditionally reserve fs blocks for left shift transactions only. This avoids the warning reproduced by generic/270. Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02xfs: split indlen reservations fairly when under reservedBrian Foster
commit 75d65361cf3c0dae2af970c305e19c727b28a510 upstream. Certain workoads that punch holes into speculative preallocation can cause delalloc indirect reservation splits when the delalloc extent is split in two. If further splits occur, an already short-handed extent can be split into two in a manner that leaves zero indirect blocks for one of the two new extents. This occurs because the shortage is large enough that the xfs_bmap_split_indlen() algorithm completely drains the requested indlen of one of the extents before it honors the existing reservation. This ultimately results in a warning from xfs_bmap_del_extent(). This has been observed during file copies of large, sparse files using 'cp --sparse=always.' To avoid this problem, update xfs_bmap_split_indlen() to explicitly apply the reservation shortage fairly between both extents. This smooths out the overall indlen shortage and defers the situation where we end up with a delalloc extent with zero indlen reservation to extreme circumstances. Reported-by: Patrick Dung <mpatdung@gmail.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02xfs: handle indlen shortage on delalloc extent mergeBrian Foster
commit 0e339ef8556d9e567aa7925f8892c263d79430d9 upstream. When a delalloc extent is created, it can be merged with pre-existing, contiguous, delalloc extents. When this occurs, xfs_bmap_add_extent_hole_delay() merges the extents along with the associated indirect block reservations. The expectation here is that the combined worst case indlen reservation is always less than or equal to the indlen reservation for the individual extents. This is not always the case, however, as existing extents can less than the expected indlen reservation if the extent was previously split due to a hole punch. If a new extent merges with such an extent, the total indlen requirement may be larger than the sum of the indlen reservations held by both extents. xfs_bmap_add_extent_hole_delay() assumes that the worst case indlen reservation is always available and assigns it to the merged extent without consideration for the indlen held by the pre-existing extent. As a result, the subsequent xfs_mod_fdblocks() call can attempt an unintentional allocation rather than a free (indicated by an ASSERT() failure). Further, if the allocation happens to fail in this context, the failure goes unhandled and creates a filesystem wide block accounting inconsistency. Fix xfs_bmap_add_extent_hole_delay() to function as designed. Cap the indlen reservation assigned to the merged extent to the sum of the indlen reservations held by each of the individual extents. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02xfs: don't fail xfs_extent_busy allocationChristoph Hellwig
commit 5e30c23d13919a718b22d4921dc5c0accc59da27 upstream. We don't just need the structure to track busy extents which can be avoided with a synchronous transaction, but also to keep track of pending discard. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02xfs: reset b_first_retry_time when clear the retry status of xfs_buf_tHou Tao
commit 4dd2eb633598cb6a5a0be2fd9a2be0819f5eeb5f upstream. After successful IO or permanent error, b_first_retry_time also needs to be cleared, else the invalid first retry time will be used by the next retry check. Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02xfs: verify free block header fieldsDarrick J. Wong
commit de14c5f541e78c59006bee56f6c5c2ef1ca07272 upstream. Perform basic sanity checking of the directory free block header fields so that we avoid hanging the system on invalid data. (Granted that just means that now we shutdown on directory write, but that seems better than hanging...) Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02xfs: check for obviously bad level values in the bmbt rootDarrick J. Wong
commit b3bf607d58520ea8c0666aeb4be60dbb724cd3a2 upstream. We can't handle a bmbt that's taller than BTREE_MAXLEVELS, and there's no such thing as a zero-level bmbt (for that we have extents format), so if we see this, send back an error code. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02xfs: filter out obviously bad btree pointersDarrick J. Wong
commit d5a91baeb6033c3392121e4d5c011cdc08dfa9f7 upstream. Don't let anybody load an obviously bad btree pointer. Since the values come from disk, we must return an error, not just ASSERT. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02xfs: fail _dir_open when readahead failsDarrick J. Wong
commit 7a652bbe366464267190c2792a32ce4fff5595ef upstream. When we open a directory, we try to readahead block 0 of the directory on the assumption that we're going to need it soon. If the bmbt is corrupt, the directory will never be usable and the readahead fails immediately, so we might as well prevent the directory from being opened at all. This prevents a subsequent read or modify operation from hitting it and taking the fs offline. NOTE: We're only checking for early failures in the block mapping, not the readahead directory block itself. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02xfs: fix toctou race when locking an inode to access the data mapDarrick J. Wong
commit 4b5bd5bf3fb182dc504b1b64e0331300f156e756 upstream. We use di_format and if_flags to decide whether we're grabbing the ilock in btree mode (btree extents not loaded) or shared mode (anything else), but the state of those fields can be changed by other threads that are also trying to load the btree extents -- IFEXTENTS gets set before the _bmap_read_extents call and cleared if it fails. We don't actually need to have IFEXTENTS set until after the bmbt records are successfully loaded and validated, which will fix the race between multiple threads trying to read the same directory. The next patch strengthens directory bmbt validation by refusing to open the directory if reading the bmbt to start directory readahead fails. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02xfs: fix eofblocks race with file extending async dio writesBrian Foster
commit e4229d6b0bc9280f29624faf170cf76a9f1ca60e upstream. It's possible for post-eof blocks to end up being used for direct I/O writes. dio write performs an upfront unwritten extent allocation, sends the dio and then updates the inode size (if necessary) on write completion. If a file release occurs while a file extending dio write is in flight, it is possible to mistake the post-eof blocks for speculative preallocation and incorrectly truncate them from the inode. This means that the resulting dio write completion can discover a hole and allocate new blocks rather than perform unwritten extent conversion. This requires a strange mix of I/O and is thus not likely to reproduce in real world workloads. It is intermittently reproduced by generic/299. The error manifests as an assert failure due to transaction overrun because the aforementioned write completion transaction has only reserved enough blocks for btree operations: XFS: Assertion failed: tp->t_blk_res_used <= tp->t_blk_res, \ file: fs/xfs//xfs_trans.c, line: 309 The root cause is that xfs_free_eofblocks() uses i_size to truncate post-eof blocks from the inode, but async, file extending direct writes do not update i_size until write completion, long after inode locks are dropped. Therefore, xfs_free_eofblocks() effectively truncates the inode to the incorrect size. Update xfs_free_eofblocks() to serialize against dio similar to how extending writes are serialized against i_size updates before post-eof block zeroing. Specifically, wait on dio while under the iolock. This ensures that dio write completions have updated i_size before post-eof blocks are processed. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-06-02xfs: pull up iolock from xfs_free_eofblocks()Brian Foster
commit a36b926180cda375ac2ec89e1748b47137cfc51c upstream. xfs_free_eofblocks() requires the IOLOCK_EXCL lock, but is called from different contexts where the lock may or may not be held. The need_iolock parameter exists for this reason, to indicate whether xfs_free_eofblocks() must acquire the iolock itself before it can proceed. This is ugly and confusing. Simplify the semantics of xfs_free_eofblocks() to require the caller to acquire the iolock appropriately and kill the need_iolock parameter. While here, the mp param can be removed as well as the xfs_mount is accessible from the xfs_inode structure. This patch does not change behavior. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>