summaryrefslogtreecommitdiffstats
path: root/include
AgeCommit message (Collapse)Author
2015-01-08userns: Add a knob to disable setgroups on a per user namespace basisEric W. Biederman
commit 9cc46516ddf497ea16e8d7cb986ae03a0f6b92f8 upstream. - Expose the knob to user space through a proc file /proc/<pid>/setgroups A value of "deny" means the setgroups system call is disabled in the current processes user namespace and can not be enabled in the future in this user namespace. A value of "allow" means the segtoups system call is enabled. - Descendant user namespaces inherit the value of setgroups from their parents. - A proc file is used (instead of a sysctl) as sysctls currently do not allow checking the permissions at open time. - Writing to the proc file is restricted to before the gid_map for the user namespace is set. This ensures that disabling setgroups at a user namespace level will never remove the ability to call setgroups from a process that already has that ability. A process may opt in to the setgroups disable for itself by creating, entering and configuring a user namespace or by calling setns on an existing user namespace with setgroups disabled. Processes without privileges already can not call setgroups so this is a noop. Prodcess with privilege become processes without privilege when entering a user namespace and as with any other path to dropping privilege they would not have the ability to call setgroups. So this remains within the bounds of what is possible without a knob to disable setgroups permanently in a user namespace. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-08userns: Don't allow setgroups until a gid mapping has been setablishedEric W. Biederman
commit 273d2c67c3e179adb1e74f403d1e9a06e3f841b5 upstream. setgroups is unique in not needing a valid mapping before it can be called, in the case of setgroups(0, NULL) which drops all supplemental groups. The design of the user namespace assumes that CAP_SETGID can not actually be used until a gid mapping is established. Therefore add a helper function to see if the user namespace gid mapping has been established and call that function in the setgroups permission check. This is part of the fix for CVE-2014-8989, being able to drop groups without privilege using user namespaces. Reviewed-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-08groups: Consolidate the setgroups permission checksEric W. Biederman
commit 7ff4d90b4c24a03666f296c3d4878cd39001e81e upstream. Today there are 3 instances of setgroups and due to an oversight their permission checking has diverged. Add a common function so that they may all share the same permission checking code. This corrects the current oversight in the current permission checks and adds a helper to avoid this in the future. A user namespace security fix will update this new helper, shortly. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-06bitops: Fix shift overflow in GENMASK macrosMaxime COQUELIN
commit 00b4d9a14125f1e51874def2b9de6092e007412d upstream. On some 32 bits architectures, including x86, GENMASK(31, 0) returns 0 instead of the expected ~0UL. This is the same on some 64 bits architectures with GENMASK_ULL(63, 0). This is due to an overflow in the shift operand, 1 << 32 for GENMASK, 1 << 64 for GENMASK_ULL. Reported-by: Eric Paire <eric.paire@st.com> Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Maxime Coquelin <maxime.coquelin@st.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: linux@rasmusvillemoes.dk Cc: gong.chen@linux.intel.com Cc: John Sullivan <jsrhbz@kanargh.force9.co.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Theodore Ts'o <tytso@mit.edu> Fixes: 10ef6b0dffe4 ("bitops: Introduce a more generic BITMASK macro") Link: http://lkml.kernel.org/r/1415267659-10563-1-git-send-email-maxime.coquelin@st.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-06iio: Fix IIO_EVENT_CODE_EXTRACT_DIR bit maskCristina Ciocan
commit ccf54555da9a5e91e454b909ca6a5303c7d6b910 upstream. The direction field is set on 7 bits, thus we need to AND it with 0111 111 mask in order to retrieve it, that is 0x7F, not 0xCF as it is now. Fixes: ade7ef7ba (staging:iio: Differential channel handling) Signed-off-by: Cristina Ciocan <cristina.ciocan@intel.com> Signed-off-by: Jonathan Cameron <jic23@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-06ASoC: dpcm: Fix race between FE/BE updates and triggerTakashi Iwai
commit ea9d0d771fcd32cd56070819749477d511ec9117 upstream. DPCM can update the FE/BE connection states totally asynchronously from the FE's PCM state. Most of FE/BE state changes are protected by mutex, so that they won't race, but there are still some actions that are uncovered. For example, suppose to switch a BE while a FE's stream is running. This would call soc_dpcm_runtime_update(), which sets FE's runtime_update flag, then sets up and starts BEs, and clears FE's runtime_update flag again. When a device emits XRUN during this operation, the PCM core triggers snd_pcm_stop(XRUN). Since the trigger action is an atomic ops, this isn't blocked by the mutex, thus it kicks off DPCM's trigger action. It eventually updates and clears FE's runtime_update flag while soc_dpcm_runtime_update() is running concurrently, and it results in confusion. Usually, for avoiding such a race, we take a lock. There is a PCM stream lock for that purpose. However, as already mentioned, the trigger action is atomic, and we can't take the lock for the whole soc_dpcm_runtime_update() or other operations that include the lengthy jobs like hw_params or prepare. This patch provides an alternative solution. This adds a way to defer the conflicting trigger callback to be executed at the end of FE/BE state changes. For doing it, two things are introduced: - Each runtime_update state change of FEs is protected via PCM stream lock. - The FE's trigger callback checks the runtime_update flag. If it's not set, the trigger action is executed there. If set, mark the pending trigger action and returns immediately. - At the exit of runtime_update state change, it checks whether the pending trigger is present. If yes, it executes the trigger action at this point. Reported-and-tested-by: Qiao Zhou <zhouqiao@marvell.com> Signed-off-by: Takashi Iwai <tiwai@suse.de> Acked-by: Liam Girdwood <liam.r.girdwood@linux.intel.com> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-06PCI/MSI: Add device flag indicating that 64-bit MSIs don't workBenjamin Herrenschmidt
commit f144d1496b47e7450f41b767d0d91c724c2198bc upstream. This can be set by quirks/drivers to be used by the architecture code that assigns the MSI addresses. We additionally add verification in the core MSI code that the values assigned by the architecture do satisfy the limitation in order to fail gracefully if they don't (ie. the arch hasn't been updated to deal with that quirk yet). Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-06inetdevice: fixed signed integer overflowVincent BENAYOUN
[ Upstream commit 84bc88688e3f6ef843aa8803dbcd90168bb89faf ] There could be a signed overflow in the following code. The expression, (32-logmask) is comprised between 0 and 31 included. It may be equal to 31. In such a case the left shift will produce a signed integer overflow. According to the C99 Standard, this is an undefined behavior. A simple fix is to replace the signed int 1 with the unsigned int 1U. Signed-off-by: Vincent BENAYOUN <vincent.benayoun@trust-in-soft.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21mm/compaction: do not count migratepages when unnecessaryVlastimil Babka
commit f8c9301fa5a2a8b873c67f2a3d8230d5c13f61b7 upstream. During compaction, update_nr_listpages() has been used to count remaining non-migrated and free pages after a call to migrage_pages(). The freepages counting has become unneccessary, and it turns out that migratepages counting is also unnecessary in most cases. The only situation when it's needed to count cc->migratepages is when migrate_pages() returns with a negative error code. Otherwise, the non-negative return value is the number of pages that were not migrated, which is exactly the count of remaining pages in the cc->migratepages list. Furthermore, any non-zero count is only interesting for the tracepoint of mm_compaction_migratepages events, because after that all remaining unmigrated pages are put back and their count is set to 0. This patch therefore removes update_nr_listpages() completely, and changes the tracepoint definition so that the manual counting is done only when the tracepoint is enabled, and only when migrate_pages() returns a negative error code. Furthermore, migrate_pages() and the tracepoints won't be called when there's nothing to migrate. This potentially avoids some wasted cycles and reduces the volume of uninteresting mm_compaction_migratepages events where "nr_migrated=0 nr_failed=0". In the stress-highalloc mmtest, this was about 75% of the events. The mm_compaction_isolate_migratepages event is better for determining that nothing was isolated for migration, and this one was just duplicating the info. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21mm, compaction: embed migration mode in compact_controlDavid Rientjes
commit e0b9daeb453e602a95ea43853dc12d385558ce1f upstream. We're going to want to manipulate the migration mode for compaction in the page allocator, and currently compact_control's sync field is only a bool. Currently, we only do MIGRATE_ASYNC or MIGRATE_SYNC_LIGHT compaction depending on the value of this bool. Convert the bool to enum migrate_mode and pass the migration mode in directly. Later, we'll want to avoid MIGRATE_SYNC_LIGHT for thp allocations in the pagefault patch to avoid unnecessary latency. This also alters compaction triggered from sysfs, either for the entire system or for a node, to force MIGRATE_SYNC. [akpm@linux-foundation.org: fix build] [iamjoonsoo.kim@lge.com: use MIGRATE_SYNC in alloc_contig_range()] Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21mm, compaction: add per-zone migration pfn cache for async compactionDavid Rientjes
commit 35979ef3393110ff3c12c6b94552208d3bdf1a36 upstream. Each zone has a cached migration scanner pfn for memory compaction so that subsequent calls to memory compaction can start where the previous call left off. Currently, the compaction migration scanner only updates the per-zone cached pfn when pageblocks were not skipped for async compaction. This creates a dependency on calling sync compaction to avoid having subsequent calls to async compaction from scanning an enormous amount of non-MOVABLE pageblocks each time it is called. On large machines, this could be potentially very expensive. This patch adds a per-zone cached migration scanner pfn only for async compaction. It is updated everytime a pageblock has been scanned in its entirety and when no pages from it were successfully isolated. The cached migration scanner pfn for sync compaction is updated only when called for sync compaction. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Greg Thelen <gthelen@google.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21mm, migration: add destination page freeing callbackDavid Rientjes
commit 68711a746345c44ae00c64d8dbac6a9ce13ac54a upstream. Memory migration uses a callback defined by the caller to determine how to allocate destination pages. When migration fails for a source page, however, it frees the destination page back to the system. This patch adds a memory migration callback defined by the caller to determine how to free destination pages. If a caller, such as memory compaction, builds its own freelist for migration targets, this can reuse already freed memory instead of scanning additional memory. If the caller provides a function to handle freeing of destination pages, it is called when page migration fails. If the caller passes NULL then freeing back to the system will be handled as usual. This patch introduces no functional change. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21mm/readahead.c: inline ra_submitFabian Frederick
commit 29f175d125f0f3a9503af8a5596f93d714cceb08 upstream. Commit f9acc8c7b35a ("readahead: sanify file_ra_state names") left ra_submit with a single function call. Move ra_submit to internal.h and inline it to save some stack. Thanks to Andrew Morton for commenting different versions. Signed-off-by: Fabian Frederick <fabf@skynet.be> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21mm: remove read_cache_page_async()Sasha Levin
commit 67f9fd91f93c582b7de2ab9325b6e179db77e4d5 upstream. This patch removes read_cache_page_async() which wasn't really needed anywhere and simplifies the code around it a bit. read_cache_page_async() is useful when we want to read a page into the cache without waiting for it to complete. This happens when the appropriate callback 'filler' doesn't complete its read operation and releases the page lock immediately, and instead queues a different completion routine to do that. This never actually happened anywhere in the code. read_cache_page_async() had 3 different callers: - read_cache_page() which is the sync version, it would just wait for the requested read to complete using wait_on_page_read(). - JFFS2 would call it from jffs2_gc_fetch_page(), but the filler function it supplied doesn't do any async reads, and would complete before the filler function returns - making it actually a sync read. - CRAMFS would call it using the read_mapping_page_async() wrapper, with a similar story to JFFS2 - the filler function doesn't do anything that reminds async reads and would always complete before the filler function returns. To sum it up, the code in mm/filemap.c never took advantage of having read_cache_page_async(). While there are filler callbacks that do async reads (such as the block one), we always called it with the read_cache_page(). This patch adds a mandatory wait for read to complete when adding a new page to the cache, and removes read_cache_page_async() and its wrappers. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21mm + fs: prepare for non-page entries in page cache radix treesJohannes Weiner
commit 0cd6144aadd2afd19d1aca880153530c52957604 upstream. shmem mappings already contain exceptional entries where swap slot information is remembered. To be able to store eviction information for regular page cache, prepare every site dealing with the radix trees directly to handle entries other than pages. The common lookup functions will filter out non-page entries and return NULL for page cache holes, just as before. But provide a raw version of the API which returns non-page entries as well, and switch shmem over to use it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21mm: filemap: move radix tree hole searching hereJohannes Weiner
commit e7b563bb2a6f4d974208da46200784b9c5b5a47e upstream. The radix tree hole searching code is only used for page cache, for example the readahead code trying to get a a picture of the area surrounding a fault. It sufficed to rely on the radix tree definition of holes, which is "empty tree slot". But this is about to change, though, as shadow page descriptors will be stored in the page cache after the actual pages get evicted from memory. Move the functions over to mm/filemap.c and make them native page cache operations, where they can later be adapted to handle the new definition of "page cache hole". Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21lib: radix-tree: add radix_tree_delete_item()Johannes Weiner
commit 53c59f262d747ea82e7414774c59a489501186a0 upstream. Provide a function that does not just delete an entry at a given index, but also allows passing in an expected item. Delete only if that item is still located at the specified index. This is handy when lockless tree traversals want to delete entries as well because they don't have to do an second, locked lookup to verify the slot has not changed under them before deleting the entry. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21net: sctp: fix skb_over_panic when receiving malformed ASCONF chunksDaniel Borkmann
commit 9de7922bc709eee2f609cd01d98aaedc4cf5ea74 upstream. Commit 6f4c618ddb0 ("SCTP : Add paramters validity check for ASCONF chunk") added basic verification of ASCONF chunks, however, it is still possible to remotely crash a server by sending a special crafted ASCONF chunk, even up to pre 2.6.12 kernels: skb_over_panic: text:ffffffffa01ea1c3 len:31056 put:30768 head:ffff88011bd81800 data:ffff88011bd81800 tail:0x7950 end:0x440 dev:<NULL> ------------[ cut here ]------------ kernel BUG at net/core/skbuff.c:129! [...] Call Trace: <IRQ> [<ffffffff8144fb1c>] skb_put+0x5c/0x70 [<ffffffffa01ea1c3>] sctp_addto_chunk+0x63/0xd0 [sctp] [<ffffffffa01eadaf>] sctp_process_asconf+0x1af/0x540 [sctp] [<ffffffff8152d025>] ? _read_unlock_bh+0x15/0x20 [<ffffffffa01e0038>] sctp_sf_do_asconf+0x168/0x240 [sctp] [<ffffffffa01e3751>] sctp_do_sm+0x71/0x1210 [sctp] [<ffffffff8147645d>] ? fib_rules_lookup+0xad/0xf0 [<ffffffffa01e6b22>] ? sctp_cmp_addr_exact+0x32/0x40 [sctp] [<ffffffffa01e8393>] sctp_assoc_bh_rcv+0xd3/0x180 [sctp] [<ffffffffa01ee986>] sctp_inq_push+0x56/0x80 [sctp] [<ffffffffa01fcc42>] sctp_rcv+0x982/0xa10 [sctp] [<ffffffffa01d5123>] ? ipt_local_in_hook+0x23/0x28 [iptable_filter] [<ffffffff8148bdc9>] ? nf_iterate+0x69/0xb0 [<ffffffff81496d10>] ? ip_local_deliver_finish+0x0/0x2d0 [<ffffffff8148bf86>] ? nf_hook_slow+0x76/0x120 [<ffffffff81496d10>] ? ip_local_deliver_finish+0x0/0x2d0 [<ffffffff81496ded>] ip_local_deliver_finish+0xdd/0x2d0 [<ffffffff81497078>] ip_local_deliver+0x98/0xa0 [<ffffffff8149653d>] ip_rcv_finish+0x12d/0x440 [<ffffffff81496ac5>] ip_rcv+0x275/0x350 [<ffffffff8145c88b>] __netif_receive_skb+0x4ab/0x750 [<ffffffff81460588>] netif_receive_skb+0x58/0x60 This can be triggered e.g., through a simple scripted nmap connection scan injecting the chunk after the handshake, for example, ... -------------- INIT[ASCONF; ASCONF_ACK] -------------> <----------- INIT-ACK[ASCONF; ASCONF_ACK] ------------ -------------------- COOKIE-ECHO --------------------> <-------------------- COOKIE-ACK --------------------- ------------------ ASCONF; UNKNOWN ------------------> ... where ASCONF chunk of length 280 contains 2 parameters ... 1) Add IP address parameter (param length: 16) 2) Add/del IP address parameter (param length: 255) ... followed by an UNKNOWN chunk of e.g. 4 bytes. Here, the Address Parameter in the ASCONF chunk is even missing, too. This is just an example and similarly-crafted ASCONF chunks could be used just as well. The ASCONF chunk passes through sctp_verify_asconf() as all parameters passed sanity checks, and after walking, we ended up successfully at the chunk end boundary, and thus may invoke sctp_process_asconf(). Parameter walking is done with WORD_ROUND() to take padding into account. In sctp_process_asconf()'s TLV processing, we may fail in sctp_process_asconf_param() e.g., due to removal of the IP address that is also the source address of the packet containing the ASCONF chunk, and thus we need to add all TLVs after the failure to our ASCONF response to remote via helper function sctp_add_asconf_response(), which basically invokes a sctp_addto_chunk() adding the error parameters to the given skb. When walking to the next parameter this time, we proceed with ... length = ntohs(asconf_param->param_hdr.length); asconf_param = (void *)asconf_param + length; ... instead of the WORD_ROUND()'ed length, thus resulting here in an off-by-one that leads to reading the follow-up garbage parameter length of 12336, and thus throwing an skb_over_panic for the reply when trying to sctp_addto_chunk() next time, which implicitly calls the skb_put() with that length. Fix it by using sctp_walk_params() [ which is also used in INIT parameter processing ] macro in the verification *and* in ASCONF processing: it will make sure we don't spill over, that we walk parameters WORD_ROUND()'ed. Moreover, we're being more defensive and guard against unknown parameter types and missized addresses. Joint work with Vlad Yasevich. Fixes: b896b82be4ae ("[SCTP] ADDIP: Support for processing incoming ASCONF_ACK chunks.") Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Vlad Yasevich <vyasevich@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21net: sctp: fix panic on duplicate ASCONF chunksDaniel Borkmann
commit b69040d8e39f20d5215a03502a8e8b4c6ab78395 upstream. When receiving a e.g. semi-good formed connection scan in the form of ... -------------- INIT[ASCONF; ASCONF_ACK] -------------> <----------- INIT-ACK[ASCONF; ASCONF_ACK] ------------ -------------------- COOKIE-ECHO --------------------> <-------------------- COOKIE-ACK --------------------- ---------------- ASCONF_a; ASCONF_b -----------------> ... where ASCONF_a equals ASCONF_b chunk (at least both serials need to be equal), we panic an SCTP server! The problem is that good-formed ASCONF chunks that we reply with ASCONF_ACK chunks are cached per serial. Thus, when we receive a same ASCONF chunk twice (e.g. through a lost ASCONF_ACK), we do not need to process them again on the server side (that was the idea, also proposed in the RFC). Instead, we know it was cached and we just resend the cached chunk instead. So far, so good. Where things get nasty is in SCTP's side effect interpreter, that is, sctp_cmd_interpreter(): While incoming ASCONF_a (chunk = event_arg) is being marked !end_of_packet and !singleton, and we have an association context, we do not flush the outqueue the first time after processing the ASCONF_ACK singleton chunk via SCTP_CMD_REPLY. Instead, we keep it queued up, although we set local_cork to 1. Commit 2e3216cd54b1 changed the precedence, so that as long as we get bundled, incoming chunks we try possible bundling on outgoing queue as well. Before this commit, we would just flush the output queue. Now, while ASCONF_a's ASCONF_ACK sits in the corked outq, we continue to process the same ASCONF_b chunk from the packet. As we have cached the previous ASCONF_ACK, we find it, grab it and do another SCTP_CMD_REPLY command on it. So, effectively, we rip the chunk->list pointers and requeue the same ASCONF_ACK chunk another time. Since we process ASCONF_b, it's correctly marked with end_of_packet and we enforce an uncork, and thus flush, thus crashing the kernel. Fix it by testing if the ASCONF_ACK is currently pending and if that is the case, do not requeue it. When flushing the output queue we may relink the chunk for preparing an outgoing packet, but eventually unlink it when it's copied into the skb right before transmission. Joint work with Vlad Yasevich. Fixes: 2e3216cd54b1 ("sctp: Follow security requirement of responding with 1 packet") Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21netfilter: xt_bpf: add mising opaque struct sk_filter definitionPablo Neira
commit e10038a8ec06ac819b7552bb67aaa6d2d6f850c1 upstream. This structure is not exposed to userspace, so fix this by defining struct sk_filter; so we skip the casting in kernelspace. This is safe since userspace has no way to lurk with that internal pointer. Fixes: e6f30c7 ("netfilter: x_tables: add xt_bpf match") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21clocksource: Remove "weak" from clocksource_default_clock() declarationBjorn Helgaas
commit 96a2adbc6f501996418da9f7afe39bf0e4d006a9 upstream. kernel/time/jiffies.c provides a default clocksource_default_clock() definition explicitly marked "weak". arch/s390 provides its own definition intended to override the default, but the "weak" attribute on the declaration applied to the s390 definition as well, so the linker chose one based on link order (see 10629d711ed7 ("PCI: Remove __weak annotation from pcibios_get_phb_of_node decl")). Remove the "weak" attribute from the clocksource_default_clock() declaration so we always prefer a non-weak definition over the weak one, independent of link order. Fixes: f1b82746c1e9 ("clocksource: Cleanup clocksource selection") Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: John Stultz <john.stultz@linaro.org> Acked-by: Ingo Molnar <mingo@kernel.org> CC: Daniel Lezcano <daniel.lezcano@linaro.org> CC: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21kgdb: Remove "weak" from kgdb_arch_pc() declarationBjorn Helgaas
commit 107bcc6d566cb40184068d888637f9aefe6252dd upstream. kernel/debug/debug_core.c provides a default kgdb_arch_pc() definition explicitly marked "weak". Several architectures provide their own definitions intended to override the default, but the "weak" attribute on the declaration applied to the arch definitions as well, so the linker chose one based on link order (see 10629d711ed7 ("PCI: Remove __weak annotation from pcibios_get_phb_of_node decl")). Remove the "weak" attribute from the declaration so we always prefer a non-weak definition over the weak one, independent of link order. Fixes: 688b744d8bc8 ("kgdb: fix signedness mixmatches, add statics, add declaration to header") Tested-by: Vineet Gupta <vgupta@synopsys.com> # for ARC build Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21vmcore: Remove "weak" from function declarationsBjorn Helgaas
commit 5ab03ac5aaa1f032e071f1b3dc433b7839359c03 upstream. For the following functions: elfcorehdr_alloc() elfcorehdr_free() elfcorehdr_read() elfcorehdr_read_notes() remap_oldmem_pfn_range() fs/proc/vmcore.c provides default definitions explicitly marked "weak". arch/s390 provides its own definitions intended to override the default ones, but the "weak" attribute on the declarations applied to the s390 definitions as well, so the linker chose one based on link order (see 10629d711ed7 ("PCI: Remove __weak annotation from pcibios_get_phb_of_node decl")). Remove the "weak" attribute from the declarations so we always prefer a non-weak definition over the weak one, independent of link order. Fixes: be8a8d069e50 ("vmcore: introduce ELF header in new memory feature") Fixes: 9cb218131de1 ("vmcore: introduce remap_oldmem_pfn_range()") Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> CC: Michael Holzheu <holzheu@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21memory-hotplug: Remove "weak" from memory_block_size_bytes() declarationBjorn Helgaas
commit e0a8400c6923a163265d52798cdd4c33f3f8ab5a upstream. drivers/base/memory.c provides a default memory_block_size_bytes() definition explicitly marked "weak". Several architectures provide their own definitions intended to override the default, but the "weak" attribute on the declaration applied to the arch definitions as well, so the linker chose one based on link order (see 10629d711ed7 ("PCI: Remove __weak annotation from pcibios_get_phb_of_node decl")). Remove the "weak" attribute from the declaration so we always prefer a non-weak definition over the weak one, independent of link order. Fixes: 41f107266b19 ("drivers: base: Add prototype declaration to the header file") Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> CC: Rashika Kheria <rashika.kheria@gmail.com> CC: Nathan Fontenot <nfont@austin.ibm.com> CC: Anton Blanchard <anton@au1.ibm.com> CC: Heiko Carstens <heiko.carstens@de.ibm.com> CC: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21power: charger-manager: Fix accessing invalidated power supply after charger ↵Krzysztof Kozlowski
unbind commit cdaf3e15385d3232b52287e50692506f8fd01a09 upstream. The charger manager obtained in probe references to power supplies for all chargers with power_supply_get_by_name() for later usage. However if such charger driver was removed then this reference would point to old power supply (from driver which was removed). This lead to accessing invalid memory which could be observed with: $ echo "max77693-charger" > /sys/bus/platform/drivers/max77693-charger/unbind $ grep . /sys/devices/virtual/power_supply/battery/charger.0/* $ grep . /sys/devices/virtual/power_supply/battery/* [ 15.339817] Unable to handle kernel paging request at virtual address 0001c12c [ 15.346187] pgd = edd08000 [ 15.348814] [0001c12c] *pgd=6dce2831, *pte=00000000, *ppte=00000000 [ 15.355075] Internal error: Oops: 80000007 [#1] PREEMPT SMP ARM [ 15.360967] Modules linked in: [ 15.364010] CPU: 2 PID: 1388 Comm: grep Not tainted 3.17.0-next-20141007-00027-ga95e761db1b0 #245 [ 15.372859] task: ee03ad00 ti: edcf6000 task.ti: edcf6000 [ 15.378241] PC is at 0x1c12c [ 15.381113] LR is at is_ext_pwr_online+0x30/0x6c [ 15.385706] pc : [<0001c12c>] lr : [<c0339fc4>] psr: a0000013 [ 15.385706] sp : edcf7e88 ip : 00000000 fp : 00000000 [ 15.397161] r10: eeb02c08 r9 : c04b1f84 r8 : eeb02c00 [ 15.402369] r7 : edc69a10 r6 : eea6ac10 r5 : eea6ac10 r4 : 00000004 [ 15.408878] r3 : 0001c12c r2 : edcf7e8c r1 : 00000004 r0 : ee914418 [ 15.415390] Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user [ 15.422506] Control: 10c5387d Table: 6dd0804a DAC: 00000015 [ 15.428236] Process grep (pid: 1388, stack limit = 0xedcf6240) [ 15.434050] Stack: (0xedcf7e88 to 0xedcf8000) [ 15.438395] 7e80: ee03ad00 00000000 edcf7f80 eea6aca8 edcf7ec4 c033b7b0 [ 15.446554] 7ea0: 00000001 ee1cc3f0 00000004 c06e1e44 eebdc000 c06e1e44 eeb02c00 c0337144 [ 15.454713] 7ec0: ee2dac68 c005cffc ee1cc3c0 c06e1e44 00000fff 00001000 eebdc000 c0278ca8 [ 15.462872] 7ee0: c0278c8c ee1cc3c0 eeb7ce00 c014422c edcf7f20 00008000 ee1cc3c0 ee9a48c0 [ 15.471030] 7f00: 00000001 00000001 edcf7f80 c0142d94 c0142d70 c01060f4 00021000 ee1cc3f0 [ 15.479190] 7f20: 00000000 00000000 c06a2150 eebdc000 2e7ec000 ee9a48c0 00008000 00021000 [ 15.487349] 7f40: edcf7f80 00008000 edcf6000 00021000 00021000 c00e39a4 00000000 ee9a48c0 [ 15.495508] 7f60: 00004000 00000000 00000000 ee9a48c0 ee9a48c0 00008000 00021000 c00e3aa0 [ 15.503668] 7f80: 00000000 00000000 0001f2e0 0001f2e0 00021000 00001000 00000003 c000f364 [ 15.511826] 7fa0: 00000000 c000f1a0 0001f2e0 00021000 00000003 00021000 00008000 00000000 [ 15.519986] 7fc0: 0001f2e0 00021000 00001000 00000003 00000001 000205e8 00000000 00021000 [ 15.528145] 7fe0: 00008000 bebbe910 0000a7ad b6edc49c 60000010 00000003 aaaaaaaa aaaaaaaa [ 15.536320] [<c0339fc4>] (is_ext_pwr_online) from [<c033b7b0>] (charger_get_property+0x170/0x314) [ 15.545164] [<c033b7b0>] (charger_get_property) from [<c0337144>] (power_supply_show_property+0x48/0x20c) [ 15.554719] [<c0337144>] (power_supply_show_property) from [<c0278ca8>] (dev_attr_show+0x1c/0x48) [ 15.563577] [<c0278ca8>] (dev_attr_show) from [<c014422c>] (sysfs_kf_seq_show+0x84/0x104) [ 15.571725] [<c014422c>] (sysfs_kf_seq_show) from [<c0142d94>] (kernfs_seq_show+0x24/0x28) [ 15.579973] [<c0142d94>] (kernfs_seq_show) from [<c01060f4>] (seq_read+0x1b0/0x484) [ 15.587614] [<c01060f4>] (seq_read) from [<c00e39a4>] (vfs_read+0x88/0x144) [ 15.594552] [<c00e39a4>] (vfs_read) from [<c00e3aa0>] (SyS_read+0x40/0x8c) [ 15.601417] [<c00e3aa0>] (SyS_read) from [<c000f1a0>] (ret_fast_syscall+0x0/0x48) [ 15.608877] Code: bad PC value [ 15.611991] ---[ end trace a88fcc95208db283 ]--- The charger-manager should get reference to charger power supply on each use of get_property callback. Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Fixes: 3bb3dbbd56ea ("power_supply: Add initial Charger-Manager driver") Signed-off-by: Sebastian Reichel <sre@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21power: charger-manager: Fix accessing invalidated power supply after fuel ↵Krzysztof Kozlowski
gauge unbind commit bdbe81445407644492b9ac69a24d35e3202d773b upstream. The charger manager obtained reference to fuel gauge power supply in probe with power_supply_get_by_name() for later usage. However if fuel gauge driver was removed and re-added then this reference would point to old power supply (from driver which was removed). This lead to accessing old (and probably invalid) memory which could be observed with: $ echo "12-0036" > /sys/bus/i2c/drivers/max17042/unbind $ echo "12-0036" > /sys/bus/i2c/drivers/max17042/bind $ cat /sys/devices/virtual/power_supply/battery/capacity [ 240.480084] INFO: task cat:1393 blocked for more than 120 seconds. [ 240.484799] Not tainted 3.17.0-next-20141007-00028-ge60b6dd79570 #203 [ 240.491782] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 240.499589] cat D c0469530 0 1393 1 0x00000000 [ 240.505947] [<c0469530>] (__schedule) from [<c0469d3c>] (schedule_preempt_disabled+0x14/0x20) [ 240.514449] [<c0469d3c>] (schedule_preempt_disabled) from [<c046af08>] (mutex_lock_nested+0x1bc/0x458) [ 240.523736] [<c046af08>] (mutex_lock_nested) from [<c0287a98>] (regmap_read+0x30/0x60) [ 240.531647] [<c0287a98>] (regmap_read) from [<c032238c>] (max17042_get_property+0x2e8/0x350) [ 240.540055] [<c032238c>] (max17042_get_property) from [<c03247d8>] (charger_get_property+0x264/0x348) [ 240.549252] [<c03247d8>] (charger_get_property) from [<c0320764>] (power_supply_show_property+0x48/0x1e0) [ 240.558808] [<c0320764>] (power_supply_show_property) from [<c027308c>] (dev_attr_show+0x1c/0x48) [ 240.567664] [<c027308c>] (dev_attr_show) from [<c0141fb0>] (sysfs_kf_seq_show+0x84/0x104) [ 240.575814] [<c0141fb0>] (sysfs_kf_seq_show) from [<c0140b18>] (kernfs_seq_show+0x24/0x28) [ 240.584061] [<c0140b18>] (kernfs_seq_show) from [<c0104574>] (seq_read+0x1b0/0x484) [ 240.591702] [<c0104574>] (seq_read) from [<c00e1e24>] (vfs_read+0x88/0x144) [ 240.598640] [<c00e1e24>] (vfs_read) from [<c00e1f20>] (SyS_read+0x40/0x8c) [ 240.605507] [<c00e1f20>] (SyS_read) from [<c000e760>] (ret_fast_syscall+0x0/0x48) [ 240.612952] 4 locks held by cat/1393: [ 240.616589] #0: (&p->lock){+.+.+.}, at: [<c01043f4>] seq_read+0x30/0x484 [ 240.623414] #1: (&of->mutex){+.+.+.}, at: [<c01417dc>] kernfs_seq_start+0x1c/0x8c [ 240.631086] #2: (s_active#31){++++.+}, at: [<c01417e4>] kernfs_seq_start+0x24/0x8c [ 240.638777] #3: (&map->mutex){+.+...}, at: [<c0287a98>] regmap_read+0x30/0x60 The charger-manager should get reference to fuel gauge power supply on each use of get_property callback. The thermal zone 'tzd' field of power supply should not be used because of the same reason. Additionally this change solves also the issue with nested thermal_zone_get_temp() calls and related false lockdep positive for deadlock for thermal zone's mutex [1]. When fuel gauge is used as source of temperature then the charger manager forwards its get_temp calls to fuel gauge thermal zone. So actually different mutexes are used (one for charger manager thermal zone and second for fuel gauge thermal zone) but for lockdep this is one class of mutex. The recursion is removed by retrieving temperature through power supply's get_property(). In case external thermal zone is used ('cm-thermal-zone' property is present in DTS) the recursion does not exist. Charger manager simply exports POWER_SUPPLY_PROP_TEMP_AMBIENT property (instead of POWER_SUPPLY_PROP_TEMP) thus no thermal zone is created for this power supply. [1] https://lkml.org/lkml/2014/10/6/309 Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Fixes: 3bb3dbbd56ea ("power_supply: Add initial Charger-Manager driver") Signed-off-by: Sebastian Reichel <sre@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21nfs: fix pnfs direct write memory leakPeng Tao
commit 8c393f9a721c30a030049a680e1bf896669bb279 upstream. For pNFS direct writes, layout driver may dynamically allocate ds_cinfo.buckets. So we need to take care to free them when freeing dreq. Ideally this needs to be done inside layout driver where ds_cinfo.buckets are allocated. But buckets are attached to dreq and reused across LD IO iterations. So I feel it's OK to free them in the generic layer. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21pinctrl: dra: dt-bindings: Fix output pull up/downRoger Quadros
commit 73b3a6657a88ef5348a0d69c9a8107d6f01ae862 upstream. For PIN_OUTPUT_PULLUP and PIN_OUTPUT_PULLDOWN we must not set the PULL_DIS bit which disables the PULLs. PULL_ENA is a 0 and using it in an OR operation is a NOP, so don't use it in the PIN_OUTPUT_PULLUP/DOWN macros. Fixes: 23d9cec07c58 ("pinctrl: dra: dt-bindings: Fix pull enable/disable") Signed-off-by: Roger Quadros <rogerq@ti.com> Acked-by: Nishanth Menon <nm@ti.com> Signed-off-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14of: Fix overflow bug in string property parsing functionsGrant Likely
commit a87fa1d81a9fb5e9adca9820e16008c40ad09f33 upstream. The string property read helpers will run off the end of the buffer if it is handed a malformed string property. Rework the parsers to make sure that doesn't happen. At the same time add new test cases to make sure the functions behave themselves. The original implementations of of_property_read_string_index() and of_property_count_strings() both open-coded the same block of parsing code, each with it's own subtly different bugs. The fix here merges functions into a single helper and makes the original functions static inline wrappers around the helper. One non-bugfix aspect of this patch is the addition of a new wrapper, of_property_read_string_array(). The new wrapper is needed by the device_properties feature that Rafael is working on and planning to merge for v3.19. The implementation is identical both with and without the new static inline wrapper, so it just got left in to reduce the churn on the header file. Signed-off-by: Grant Likely <grant.likely@linaro.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Mika Westerberg <mika.westerberg@linux.intel.com> Cc: Rob Herring <robh+dt@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Darren Hart <darren.hart@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14drm/radeon: remove invalid pci idAlex Deucher
commit 8c3e434769b1707fd2d24de5a2eb25fedc634c4a upstream. 0x4c6e is a secondary device id so should not be used by the driver. Noticed-by: Mark Kettenis <mark.kettenis@xs4all.nl> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14HID: usbhid: add always-poll quirkJohan Hovold
commit 0b750b3baa2d64f1b77aecc10f20deeb28efe60d upstream. Add quirk to make sure that a device is always polled for input events even if it hasn't been opened. This is needed for devices that disconnects from the bus unless the interrupt endpoint has been polled at least once or when not responding to an input event (e.g. after having shut down X). Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14USB: core: add device-qualifier quirkJohan Hovold
commit 2a159389bf5d962359349a76827b2f683276a1c7 upstream. Add new quirk for devices that cannot handle requests for the device_qualifier descriptor. A USB-2.0 compliant device must respond to requests for the device_qualifier descriptor (even if it's with a request error), but at least one device is known to misbehave after such a request. Suggested-by: Bjørn Mork <bjorn@mork.no> Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14OOM, PM: OOM killed task shouldn't escape PM suspendMichal Hocko
commit 5695be142e203167e3cb515ef86a88424f3524eb upstream. PM freezer relies on having all tasks frozen by the time devices are getting frozen so that no task will touch them while they are getting frozen. But OOM killer is allowed to kill an already frozen task in order to handle OOM situtation. In order to protect from late wake ups OOM killer is disabled after all tasks are frozen. This, however, still keeps a window open when a killed task didn't manage to die by the time freeze_processes finishes. Reduce the race window by checking all tasks after OOM killer has been disabled. This is still not race free completely unfortunately because oom_killer_disable cannot stop an already ongoing OOM killer so a task might still wake up from the fridge and get killed without freeze_processes noticing. Full synchronization of OOM and freezer is, however, too heavy weight for this highly unlikely case. Introduce and check oom_kills counter which gets incremented early when the allocator enters __alloc_pages_may_oom path and only check all the tasks if the counter changes during the freezing attempt. The counter is updated so early to reduce the race window since allocator checked oom_killer_disabled which is set by PM-freezing code. A false positive will push the PM-freezer into a slow path but that is not a big deal. Changes since v1 - push the re-check loop out of freeze_processes into check_frozen_processes and invert the condition to make the code more readable as per Rafael Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring) Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14drm/vmwgfx: Fix drm.h includeJosh Boyer
commit e351943b081f4d9e6f692ce1a6117e8d2e71f478 upstream. The userspace drm.h include doesn't prefix the drm directory. This can lead to compile failures as /usr/include/drm/ isn't in the standard gcc include paths. Fix it to be <drm/drm.h>, which matches the rest of the driver drm header files that get installed into /usr/include/drm. Red Hat Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1138759 Fixes: 1d7a5cbf8f74e Reported-by: Jeffrey Bastian <jbastian@redhat.com> Signed-off-by: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14block: fix alignment_offset math that assumes io_min is a power-of-2Mike Snitzer
commit b8839b8c55f3fdd60dc36abcda7e0266aff7985c upstream. The math in both blk_stack_limits() and queue_limit_alignment_offset() assume that a block device's io_min (aka minimum_io_size) is always a power-of-2. Fix the math such that it works for non-power-of-2 io_min. This issue (of alignment_offset != 0) became apparent when testing dm-thinp with a thinp blocksize that matches a RAID6 stripesize of 1280K. Commit fdfb4c8c1 ("dm thin: set minimum_io_size to pool's data block size") unlocked the potential for alignment_offset != 0 due to the dm-thin-pool's io_min possibly being a non-power-of-2. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14random: add and use memzero_explicit() for clearing dataDaniel Borkmann
commit d4c5efdb97773f59a2b711754ca0953f24516739 upstream. zatimend has reported that in his environment (3.16/gcc4.8.3/corei7) memset() calls which clear out sensitive data in extract_{buf,entropy, entropy_user}() in random driver are being optimized away by gcc. Add a helper memzero_explicit() (similarly as explicit_bzero() variants) that can be used in such cases where a variable with sensitive data is being cleared out in the end. Other use cases might also be in crypto code. [ I have put this into lib/string.c though, as it's always built-in and doesn't need any dependencies then. ] Fixes kernel bugzilla: 82041 Reported-by: zatimend@hotmail.co.uk Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14vfs: fix data corruption when blocksize < pagesize for mmaped dataJan Kara
commit 90a8020278c1598fafd071736a0846b38510309c upstream. ->page_mkwrite() is used by filesystems to allocate blocks under a page which is becoming writeably mmapped in some process' address space. This allows a filesystem to return a page fault if there is not enough space available, user exceeds quota or similar problem happens, rather than silently discarding data later when writepage is called. However VFS fails to call ->page_mkwrite() in all the cases where filesystems need it when blocksize < pagesize. For example when blocksize = 1024, pagesize = 4096 the following is problematic: ftruncate(fd, 0); pwrite(fd, buf, 1024, 0); map = mmap(NULL, 1024, PROT_WRITE, MAP_SHARED, fd, 0); map[0] = 'a'; ----> page_mkwrite() for index 0 is called ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */ mremap(map, 1024, 10000, 0); map[4095] = 'a'; ----> no page_mkwrite() called At the moment ->page_mkwrite() is called, filesystem can allocate only one block for the page because i_size == 1024. Otherwise it would create blocks beyond i_size which is generally undesirable. But later at ->writepage() time, we also need to store data at offset 4095 but we don't have block allocated for it. This patch introduces a helper function filesystems can use to have ->page_mkwrite() called at all the necessary moments. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14SUNRPC: Don't wake tasks during connection abortBenjamin Coddington
commit a743419f420a64d442280845c0377a915b76644f upstream. When aborting a connection to preserve source ports, don't wake the task in xs_error_report. This allows tasks with RPC_TASK_SOFTCONN to succeed if the connection needs to be re-established since it preserves the task's status instead of setting it to the status of the aborting kernel_connect(). This may also avoid a potential conflict on the socket's lock. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14drivers/net, ipv6: Select IPv6 fragment idents for virtio UFO packetsBen Hutchings
[ Upstream commit 5188cd44c55db3e92cd9e77a40b5baa7ed4340f7 ] UFO is now disabled on all drivers that work with virtio net headers, but userland may try to send UFO/IPv6 packets anyway. Instead of sending with ID=0, we should select identifiers on their behalf (as we used to). Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Fixes: 916e4cf46d02 ("ipv6: reuse ip6_frag_id from ip6_ufo_append_data") Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30kernel: add support for gcc 5Sasha Levin
commit 71458cfc782eafe4b27656e078d379a34e472adf upstream. We're missing include/linux/compiler-gcc5.h which is required now because gcc branched off to v5 in trunk. Just copy the relevant bits out of include/linux/compiler-gcc4.h, no new code is added as of now. This fixes a build error when using gcc 5. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30mm: clear __GFP_FS when PF_MEMALLOC_NOIO is setJunxiao Bi
commit 934f3072c17cc8886f4c043b47eeeb1b12f8de33 upstream. commit 21caf2fc1931 ("mm: teach mm by current context info to not do I/O during memory allocation") introduces PF_MEMALLOC_NOIO flag to avoid doing I/O inside memory allocation, __GFP_IO is cleared when this flag is set, but __GFP_FS implies __GFP_IO, it should also be cleared. Or it may still run into I/O, like in superblock shrinker. And this will make the kernel run into the deadlock case described in that commit. See Dave Chinner's comment about io in superblock shrinker: Filesystem shrinkers do indeed perform IO from the superblock shrinker and have for years. Even clean inodes can require IO before they can be freed - e.g. on an orphan list, need truncation of post-eof blocks, need to wait for ordered operations to complete before it can be freed, etc. IOWs, Ext4, btrfs and XFS all can issue and/or block on arbitrary amounts of IO in the superblock shrinker context. XFS, in particular, has been doing transactions and IO from the VFS inode cache shrinker since it was first introduced.... Fix this by clearing __GFP_FS in memalloc_noio_flags(), this function has masked all the gfp_mask that will be passed into fs for the processes setting PF_MEMALLOC_NOIO in the direct reclaim path. v1 thread at: https://lkml.org/lkml/2014/9/3/32 Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Cc: joyce.xue <xuejiufei@huawei.com> Cc: Ming Lei <ming.lei@canonical.com> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30pci_ids: Add support for Intel Quark ILBJosef Ahmad
commit bb048713bba3ead39f6112910906d9fe3f88ede7 upstream. This patch adds the PCI id for Intel Quark ILB. It will be used for GPIO and Multifunction device driver. Signed-off-by: Josef Ahmad <josef.ahmad@intel.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Chang Rebecca Swee Fun <rebecca.swee.fun.chang@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-15USB: Add device quirk for ASUS T100 Base Station keyboardLu Baolu
commit ddbe1fca0bcb87ca8c199ea873a456ca8a948567 upstream. This full-speed USB device generates spurious remote wakeup event as soon as USB_DEVICE_REMOTE_WAKEUP feature is set. As the result, Linux can't enter system suspend and S0ix power saving modes once this keyboard is used. This patch tries to introduce USB_QUIRK_IGNORE_REMOTE_WAKEUP quirk. With this quirk set, wakeup capability will be ignored during device configure. This patch could be back-ported to kernels as old as 2.6.39. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Acked-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-15sctp: handle association restarts when the socket is closed.Vlad Yasevich
[ Upstream commit bdf6fa52f01b941d4a80372d56de465bdbbd1d23 ] Currently association restarts do not take into consideration the state of the socket. When a restart happens, the current assocation simply transitions into established state. This creates a condition where a remote system, through a the restart procedure, may create a local association that is no way reachable by user. The conditions to trigger this are as follows: 1) Remote does not acknoledge some data causing data to remain outstanding. 2) Local application calls close() on the socket. Since data is still outstanding, the association is placed in SHUTDOWN_PENDING state. However, the socket is closed. 3) The remote tries to create a new association, triggering a restart on the local system. The association moves from SHUTDOWN_PENDING to ESTABLISHED. At this point, it is no longer reachable by any socket on the local system. This patch addresses the above situation by moving the newly ESTABLISHED association into SHUTDOWN-SENT state and bundling a SHUTDOWN after the COOKIE-ACK chunk. This way, the restarted associate immidiately enters the shutdown procedure and forces the termination of the unreachable association. Reported-by: David Laight <David.Laight@aculab.com> Signed-off-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-15xfrm: Generate queueing routes only from route lookup functionsSteffen Klassert
[ Upstream commit b8c203b2d2fc961bafd53b41d5396bbcdec55998 ] Currently we genarate a queueing route if we have matching policies but can not resolve the states and the sysctl xfrm_larval_drop is disabled. Here we assume that dst_output() is called to kill the queued packets. Unfortunately this assumption is not true in all cases, so it is possible that these packets leave the system unwanted. We fix this by generating queueing routes only from the route lookup functions, here we can guarantee a call to dst_output() afterwards. Fixes: a0073fe18e71 ("xfrm: Add a state resolution packet queue") Reported-by: Konstantinos Kolelis <k.kolelis@sirrix.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-15xfrm: Generate blackhole routes only from route lookup functionsSteffen Klassert
[ Upstream commit f92ee61982d6da15a9e49664ecd6405a15a2ee56 ] Currently we genarate a blackhole route route whenever we have matching policies but can not resolve the states. Here we assume that dst_output() is called to kill the balckholed packets. Unfortunately this assumption is not true in all cases, so it is possible that these packets leave the system unwanted. We fix this by generating blackhole routes only from the route lookup functions, here we can guarantee a call to dst_output() afterwards. Fixes: 2774c131b1d ("xfrm: Handle blackhole route creation via afinfo.") Reported-by: Konstantinos Kolelis <k.kolelis@sirrix.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-15tcp: fix tcp_release_cb() to dispatch via address family for mtu_reduced()Neal Cardwell
[ Upstream commit 4fab9071950c2021d846e18351e0f46a1cffd67b ] Make sure we use the correct address-family-specific function for handling MTU reductions from within tcp_release_cb(). Previously AF_INET6 sockets were incorrectly always using the IPv6 code path when sometimes they were handling IPv4 traffic and thus had an IPv4 dst. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Diagnosed-by: Willem de Bruijn <willemb@google.com> Fixes: 563d34d057862 ("tcp: dont drop MTU reduction indications") Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-15tcp: don't use timestamp from repaired skb-s to calculate RTT (v2)Andrey Vagin
[ Upstream commit 9d186cac7ffb1831e9f34cb4a3a8b22abb9dd9d4 ] We don't know right timestamp for repaired skb-s. Wrong RTT estimations isn't good, because some congestion modules heavily depends on it. This patch adds the TCPCB_REPAIRED flag, which is included in TCPCB_RETRANS. Thanks to Eric for the advice how to fix this issue. This patch fixes the warning: [ 879.562947] WARNING: CPU: 0 PID: 2825 at net/ipv4/tcp_input.c:3078 tcp_ack+0x11f5/0x1380() [ 879.567253] CPU: 0 PID: 2825 Comm: socket-tcpbuf-l Not tainted 3.16.0-next-20140811 #1 [ 879.567829] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 879.568177] 0000000000000000 00000000c532680c ffff880039643d00 ffffffff817aa2d2 [ 879.568776] 0000000000000000 ffff880039643d38 ffffffff8109afbd ffff880039d6ba80 [ 879.569386] ffff88003a449800 000000002983d6bd 0000000000000000 000000002983d6bc [ 879.569982] Call Trace: [ 879.570264] [<ffffffff817aa2d2>] dump_stack+0x4d/0x66 [ 879.570599] [<ffffffff8109afbd>] warn_slowpath_common+0x7d/0xa0 [ 879.570935] [<ffffffff8109b0ea>] warn_slowpath_null+0x1a/0x20 [ 879.571292] [<ffffffff816d0a05>] tcp_ack+0x11f5/0x1380 [ 879.571614] [<ffffffff816d10bd>] tcp_rcv_established+0x1ed/0x710 [ 879.571958] [<ffffffff816dc9da>] tcp_v4_do_rcv+0x10a/0x370 [ 879.572315] [<ffffffff81657459>] release_sock+0x89/0x1d0 [ 879.572642] [<ffffffff816c81a0>] do_tcp_setsockopt.isra.36+0x120/0x860 [ 879.573000] [<ffffffff8110a52e>] ? rcu_read_lock_held+0x6e/0x80 [ 879.573352] [<ffffffff816c8912>] tcp_setsockopt+0x32/0x40 [ 879.573678] [<ffffffff81654ac4>] sock_common_setsockopt+0x14/0x20 [ 879.574031] [<ffffffff816537b0>] SyS_setsockopt+0x80/0xf0 [ 879.574393] [<ffffffff817b40a9>] system_call_fastpath+0x16/0x1b [ 879.574730] ---[ end trace a17cbc38eb8c5c00 ]--- v2: moving setting of skb->when for repaired skb-s in tcp_write_xmit, where it's set for other skb-s. Fixes: 431a91242d8d ("tcp: timestamp SYN+DATA messages") Fixes: 740b0f1841f6 ("tcp: switch rtt estimations to usec resolution") Cc: Eric Dumazet <edumazet@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-15net: Always untag vlan-tagged traffic on input.Vlad Yasevich
[ Upstream commit 0d5501c1c828fb97d02af50aa9d2b1a5498b94e4 ] Currently the functionality to untag traffic on input resides as part of the vlan module and is build only when VLAN support is enabled in the kernel. When VLAN is disabled, the function vlan_untag() turns into a stub and doesn't really untag the packets. This seems to create an interesting interaction between VMs supporting checksum offloading and some network drivers. There are some drivers that do not allow the user to change tx-vlan-offload feature of the driver. These drivers also seem to assume that any VLAN-tagged traffic they transmit will have the vlan information in the vlan_tci and not in the vlan header already in the skb. When transmitting skbs that already have tagged data with partial checksum set, the checksum doesn't appear to be updated correctly by the card thus resulting in a failure to establish TCP connections. The following is a packet trace taken on the receiver where a sender is a VM with a VLAN configued. The host VM is running on doest not have VLAN support and the outging interface on the host is tg3: 10:12:43.503055 52:54:00:ae:42:3f > 28:d2:44:7d:c2:de, ethertype 802.1Q (0x8100), length 78: vlan 100, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 27243, offset 0, flags [DF], proto TCP (6), length 60) 10.0.100.1.58545 > 10.0.100.10.ircu-2: Flags [S], cksum 0xdc39 (incorrect -> 0x48d9), seq 1069378582, win 29200, options [mss 1460,sackOK,TS val 4294837885 ecr 0,nop,wscale 7], length 0 10:12:44.505556 52:54:00:ae:42:3f > 28:d2:44:7d:c2:de, ethertype 802.1Q (0x8100), length 78: vlan 100, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 27244, offset 0, flags [DF], proto TCP (6), length 60) 10.0.100.1.58545 > 10.0.100.10.ircu-2: Flags [S], cksum 0xdc39 (incorrect -> 0x44ee), seq 1069378582, win 29200, options [mss 1460,sackOK,TS val 4294838888 ecr 0,nop,wscale 7], length 0 This connection finally times out. I've only access to the TG3 hardware in this configuration thus have only tested this with TG3 driver. There are a lot of other drivers that do not permit user changes to vlan acceleration features, and I don't know if they all suffere from a similar issue. The patch attempt to fix this another way. It moves the vlan header stipping code out of the vlan module and always builds it into the kernel network core. This way, even if vlan is not supported on a virtualizatoin host, the virtual machines running on top of such host will still work with VLANs enabled. CC: Patrick McHardy <kaber@trash.net> CC: Nithin Nayak Sujir <nsujir@broadcom.com> CC: Michael Chan <mchan@broadcom.com> CC: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-09mm: per-thread vma cachingDavidlohr Bueso
commit 615d6e8756c87149f2d4c1b93d471bca002bd849 upstream. This patch is a continuation of efforts trying to optimize find_vma(), avoiding potentially expensive rbtree walks to locate a vma upon faults. The original approach (https://lkml.org/lkml/2013/11/1/410), where the largest vma was also cached, ended up being too specific and random, thus further comparison with other approaches were needed. There are two things to consider when dealing with this, the cache hit rate and the latency of find_vma(). Improving the hit-rate does not necessarily translate in finding the vma any faster, as the overhead of any fancy caching schemes can be too high to consider. We currently cache the last used vma for the whole address space, which provides a nice optimization, reducing the total cycles in find_vma() by up to 250%, for workloads with good locality. On the other hand, this simple scheme is pretty much useless for workloads with poor locality. Analyzing ebizzy runs shows that, no matter how many threads are running, the mmap_cache hit rate is less than 2%, and in many situations below 1%. The proposed approach is to replace this scheme with a small per-thread cache, maximizing hit rates at a very low maintenance cost. Invalidations are performed by simply bumping up a 32-bit sequence number. The only expensive operation is in the rare case of a seq number overflow, where all caches that share the same address space are flushed. Upon a miss, the proposed replacement policy is based on the page number that contains the virtual address in question. Concretely, the following results are seen on an 80 core, 8 socket x86-64 box: 1) System bootup: Most programs are single threaded, so the per-thread scheme does improve ~50% hit rate by just adding a few more slots to the cache. +----------------+----------+------------------+ | caching scheme | hit-rate | cycles (billion) | +----------------+----------+------------------+ | baseline | 50.61% | 19.90 | | patched | 73.45% | 13.58 | +----------------+----------+------------------+ 2) Kernel build: This one is already pretty good with the current approach as we're dealing with good locality. +----------------+----------+------------------+ | caching scheme | hit-rate | cycles (billion) | +----------------+----------+------------------+ | baseline | 75.28% | 11.03 | | patched | 88.09% | 9.31 | +----------------+----------+------------------+ 3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload. +----------------+----------+------------------+ | caching scheme | hit-rate | cycles (billion) | +----------------+----------+------------------+ | baseline | 70.66% | 17.14 | | patched | 91.15% | 12.57 | +----------------+----------+------------------+ 4) Ebizzy: There's a fair amount of variation from run to run, but this approach always shows nearly perfect hit rates, while baseline is just about non-existent. The amounts of cycles can fluctuate between anywhere from ~60 to ~116 for the baseline scheme, but this approach reduces it considerably. For instance, with 80 threads: +----------------+----------+------------------+ | caching scheme | hit-rate | cycles (billion) | +----------------+----------+------------------+ | baseline | 1.06% | 91.54 | | patched | 99.97% | 14.18 | +----------------+----------+------------------+ [akpm@linux-foundation.org: fix nommu build, per Davidlohr] [akpm@linux-foundation.org: document vmacache_valid() logic] [akpm@linux-foundation.org: attempt to untangle header files] [akpm@linux-foundation.org: add vmacache_find() BUG_ON] [hughd@google.com: add vmacache_valid_mm() (from Oleg)] [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: adjust and enhance comments] Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Michel Lespinasse <walken@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Tested-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>