summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)Author
2016-12-10net: ping: check minimum size on ICMP header lengthKees Cook
[ Upstream commit 0eab121ef8750a5c8637d51534d5e9143fb0633f ] Prior to commit c0371da6047a ("put iov_iter into msghdr") in v3.19, there was no check that the iovec contained enough bytes for an ICMP header, and the read loop would walk across neighboring stack contents. Since the iov_iter conversion, bad arguments are noticed, but the returned error is EFAULT. Returning EINVAL is a clearer error and also solves the problem prior to v3.19. This was found using trinity with KASAN on v3.18: BUG: KASAN: stack-out-of-bounds in memcpy_fromiovec+0x60/0x114 at addr ffffffc071077da0 Read of size 8 by task trinity-c2/9623 page:ffffffbe034b9a08 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0x0() page dumped because: kasan: bad access detected CPU: 0 PID: 9623 Comm: trinity-c2 Tainted: G BU 3.18.0-dirty #15 Hardware name: Google Tegra210 Smaug Rev 1,3+ (DT) Call trace: [<ffffffc000209c98>] dump_backtrace+0x0/0x1ac arch/arm64/kernel/traps.c:90 [<ffffffc000209e54>] show_stack+0x10/0x1c arch/arm64/kernel/traps.c:171 [< inline >] __dump_stack lib/dump_stack.c:15 [<ffffffc000f18dc4>] dump_stack+0x7c/0xd0 lib/dump_stack.c:50 [< inline >] print_address_description mm/kasan/report.c:147 [< inline >] kasan_report_error mm/kasan/report.c:236 [<ffffffc000373dcc>] kasan_report+0x380/0x4b8 mm/kasan/report.c:259 [< inline >] check_memory_region mm/kasan/kasan.c:264 [<ffffffc00037352c>] __asan_load8+0x20/0x70 mm/kasan/kasan.c:507 [<ffffffc0005b9624>] memcpy_fromiovec+0x5c/0x114 lib/iovec.c:15 [< inline >] memcpy_from_msg include/linux/skbuff.h:2667 [<ffffffc000ddeba0>] ping_common_sendmsg+0x50/0x108 net/ipv4/ping.c:674 [<ffffffc000dded30>] ping_v4_sendmsg+0xd8/0x698 net/ipv4/ping.c:714 [<ffffffc000dc91dc>] inet_sendmsg+0xe0/0x12c net/ipv4/af_inet.c:749 [< inline >] __sock_sendmsg_nosec net/socket.c:624 [< inline >] __sock_sendmsg net/socket.c:632 [<ffffffc000cab61c>] sock_sendmsg+0x124/0x164 net/socket.c:643 [< inline >] SYSC_sendto net/socket.c:1797 [<ffffffc000cad270>] SyS_sendto+0x178/0x1d8 net/socket.c:1761 CVE-2016-8399 Reported-by: Qidan He <i@flanker017.me> Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-10net: avoid signed overflows for SO_{SND|RCV}BUFFORCEEric Dumazet
[ Upstream commit b98b0bc8c431e3ceb4b26b0dfc8db509518fb290 ] CAP_NET_ADMIN users should not be allowed to set negative sk_sndbuf or sk_rcvbuf values, as it can lead to various memory corruptions, crashes, OOM... Note that before commit 82981930125a ("net: cleanups in sock_setsockopt()"), the bug was even more serious, since SO_SNDBUF and SO_RCVBUF were vulnerable. This needs to be backported to all known linux kernels. Again, many thanks to syzkaller team for discovering this gem. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-10geneve: avoid use-after-free of skb->dataSabrina Dubroca
[ Upstream commit 5b01014759991887b1e450c9def01e58c02ab81b ] geneve{,6}_build_skb can end up doing a pskb_expand_head(), which makes the ip_hdr(skb) reference we stashed earlier stale. Since it's only needed as an argument to ip_tunnel_ecn_encap(), move this directly in the function call. Fixes: 08399efc6319 ("geneve: ensure ECN info is handled properly in all tx/rx paths") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-10tipc: check minimum bearer MTUMichal Kubeček
[ Upstream commit 3de81b758853f0b29c61e246679d20b513c4cfec ] Qian Zhang (张谦) reported a potential socket buffer overflow in tipc_msg_build() which is also known as CVE-2016-8632: due to insufficient checks, a buffer overflow can occur if MTU is too short for even tipc headers. As anyone can set device MTU in a user/net namespace, this issue can be abused by a regular user. As agreed in the discussion on Ben Hutchings' original patch, we should check the MTU at the moment a bearer is attached rather than for each processed packet. We also need to repeat the check when bearer MTU is adjusted to new device MTU. UDP case also needs a check to avoid overflow when calculating bearer MTU. Fixes: b97bf3fd8f6a ("[TIPC] Initial merge") Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reported-by: Qian Zhang (张谦) <zhangqian-c@360.cn> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-10sh_eth: remove unchecked interrupts for RZ/A1Chris Brandt
[ Upstream commit 33d446dbba4d4d6a77e1e900d434fa99e0f02c86 ] When streaming a lot of data and the RZ/A1 can't keep up, some status bits will get set that are not being checked or cleared which cause the following messages and the Ethernet driver to stop working. This patch fixes that issue. irq 21: nobody cared (try booting with the "irqpoll" option) handlers: [<c036b71c>] sh_eth_interrupt Disabling IRQ #21 Fixes: db893473d313a4ad ("sh_eth: Add support for r7s72100") Signed-off-by: Chris Brandt <chris.brandt@renesas.com> Acked-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-10net: bcmgenet: Utilize correct struct device for all DMA operationsFlorian Fainelli
[ Upstream commit 8c4799ac799665065f9bf1364fd71bf4f7dc6a4a ] __bcmgenet_tx_reclaim() and bcmgenet_free_rx_buffers() are not using the same struct device during unmap that was used for the map operation, which makes DMA-API debugging warn about it. Fix this by always using &priv->pdev->dev throughout the driver, using an identical device reference for all map/unmap calls. Fixes: 1c1008c793fa ("net: bcmgenet: add main driver file") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-10cdc_ether: Fix handling connection notificationKristian Evensen
[ Upstream commit d5c83d0d1d83b3798c71e0c8b7c3624d39c91d88 ] Commit bfe9b9d2df66 ("cdc_ether: Improve ZTE MF823/831/910 handling") introduced a work-around in usbnet_cdc_status() for devices that exported cdc carrier on twice on connect. Before the commit, this behavior caused the link state to be incorrect. It was assumed that all CDC Ethernet devices would either export this behavior, or send one off and then one on notification (which seems to be the default behavior). Unfortunately, it turns out multiple devices sends a connection notification multiple times per second (via an interrupt), even when connection state does not change. This has been observed with several different USB LAN dongles (at least), for example 13b1:0041 (Linksys). After bfe9b9d2df66, the link state has been set as down and then up for each notification. This has caused a flood of Netlink NEWLINK messages and syslog to be flooded with messages similar to: cdc_ether 2-1:2.0 eth1: kevent 12 may have been dropped This commit fixes the behavior by reverting usbnet_cdc_status() to how it was before bfe9b9d2df66. The work-around has been moved to a separate status-function which is only called when a known, affect device is detected. v1->v2: * Do not open-code netif_carrier_ok() (thanks Henning Schild). * Call netif_carrier_off() instead of usb_link_change(). This prevents calling schedule_work() twice without giving the work queue a chance to be processed (thanks Bjørn Mork). Fixes: bfe9b9d2df66 ("cdc_ether: Improve ZTE MF823/831/910 handling") Reported-by: Henning Schild <henning.schild@siemens.com> Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-10ip6_offload: check segs for NULL in ipv6_gso_segment.Artem Savkov
[ Upstream commit 6b6ebb6b01c873d0cfe3449e8a1219ee6e5fc022 ] segs needs to be checked for being NULL in ipv6_gso_segment() before calling skb_shinfo(segs), otherwise kernel can run into a NULL-pointer dereference: [ 97.811262] BUG: unable to handle kernel NULL pointer dereference at 00000000000000cc [ 97.819112] IP: [<ffffffff816e52f9>] ipv6_gso_segment+0x119/0x2f0 [ 97.825214] PGD 0 [ 97.827047] [ 97.828540] Oops: 0000 [#1] SMP [ 97.831678] Modules linked in: vhost_net vhost macvtap macvlan nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bridge stp llc snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic snd_hda_intel snd_hda_codec edac_mce_amd snd_hda_core edac_core snd_hwdep kvm_amd snd_seq kvm snd_seq_device snd_pcm irqbypass snd_timer ppdev parport_serial snd parport_pc k10temp pcspkr soundcore parport sp5100_tco shpchp sg wmi i2c_piix4 acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod cdrom sd_mod ata_generic pata_acpi amdkfd amd_iommu_v2 radeon broadcom bcm_phy_lib i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci serio_raw tg3 firewire_ohci libahci pata_atiixp drm ptp libata firewire_core pps_core i2c_core crc_itu_t fjes dm_mirror dm_region_hash dm_log dm_mod [ 97.927721] CPU: 1 PID: 3504 Comm: vhost-3495 Not tainted 4.9.0-7.el7.test.x86_64 #1 [ 97.935457] Hardware name: AMD Snook/Snook, BIOS ESK0726A 07/26/2010 [ 97.941806] task: ffff880129a1c080 task.stack: ffffc90001bcc000 [ 97.947720] RIP: 0010:[<ffffffff816e52f9>] [<ffffffff816e52f9>] ipv6_gso_segment+0x119/0x2f0 [ 97.956251] RSP: 0018:ffff88012fc43a10 EFLAGS: 00010207 [ 97.961557] RAX: 0000000000000000 RBX: ffff8801292c8700 RCX: 0000000000000594 [ 97.968687] RDX: 0000000000000593 RSI: ffff880129a846c0 RDI: 0000000000240000 [ 97.975814] RBP: ffff88012fc43a68 R08: ffff880129a8404e R09: 0000000000000000 [ 97.982942] R10: 0000000000000000 R11: ffff880129a84076 R12: 00000020002949b3 [ 97.990070] R13: ffff88012a580000 R14: 0000000000000000 R15: ffff88012a580000 [ 97.997198] FS: 0000000000000000(0000) GS:ffff88012fc40000(0000) knlGS:0000000000000000 [ 98.005280] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 98.011021] CR2: 00000000000000cc CR3: 0000000126c5d000 CR4: 00000000000006e0 [ 98.018149] Stack: [ 98.020157] 00000000ffffffff ffff88012fc43ac8 ffffffffa017ad0a 000000000000000e [ 98.027584] 0000001300000000 0000000077d59998 ffff8801292c8700 00000020002949b3 [ 98.035010] ffff88012a580000 0000000000000000 ffff88012a580000 ffff88012fc43a98 [ 98.042437] Call Trace: [ 98.044879] <IRQ> [ 98.046803] [<ffffffffa017ad0a>] ? tg3_start_xmit+0x84a/0xd60 [tg3] [ 98.053156] [<ffffffff815eeee0>] skb_mac_gso_segment+0xb0/0x130 [ 98.059158] [<ffffffff815eefd3>] __skb_gso_segment+0x73/0x110 [ 98.064985] [<ffffffff815ef40d>] validate_xmit_skb+0x12d/0x2b0 [ 98.070899] [<ffffffff815ef5d2>] validate_xmit_skb_list+0x42/0x70 [ 98.077073] [<ffffffff81618560>] sch_direct_xmit+0xd0/0x1b0 [ 98.082726] [<ffffffff815efd86>] __dev_queue_xmit+0x486/0x690 [ 98.088554] [<ffffffff8135c135>] ? cpumask_next_and+0x35/0x50 [ 98.094380] [<ffffffff815effa0>] dev_queue_xmit+0x10/0x20 [ 98.099863] [<ffffffffa09ce057>] br_dev_queue_push_xmit+0xa7/0x170 [bridge] [ 98.106907] [<ffffffffa09ce161>] br_forward_finish+0x41/0xc0 [bridge] [ 98.113430] [<ffffffff81627cf2>] ? nf_iterate+0x52/0x60 [ 98.118735] [<ffffffff81627d6b>] ? nf_hook_slow+0x6b/0xc0 [ 98.124216] [<ffffffffa09ce32c>] __br_forward+0x14c/0x1e0 [bridge] [ 98.130480] [<ffffffffa09ce120>] ? br_dev_queue_push_xmit+0x170/0x170 [bridge] [ 98.137785] [<ffffffffa09ce4bd>] br_forward+0x9d/0xb0 [bridge] [ 98.143701] [<ffffffffa09cfbb7>] br_handle_frame_finish+0x267/0x560 [bridge] [ 98.150834] [<ffffffffa09d0064>] br_handle_frame+0x174/0x2f0 [bridge] [ 98.157355] [<ffffffff8102fb89>] ? sched_clock+0x9/0x10 [ 98.162662] [<ffffffff810b63b2>] ? sched_clock_cpu+0x72/0xa0 [ 98.168403] [<ffffffff815eccf5>] __netif_receive_skb_core+0x1e5/0xa20 [ 98.174926] [<ffffffff813659f9>] ? timerqueue_add+0x59/0xb0 [ 98.180580] [<ffffffff815ed548>] __netif_receive_skb+0x18/0x60 [ 98.186494] [<ffffffff815ee625>] process_backlog+0x95/0x140 [ 98.192145] [<ffffffff815edccd>] net_rx_action+0x16d/0x380 [ 98.197713] [<ffffffff8170cff1>] __do_softirq+0xd1/0x283 [ 98.203106] [<ffffffff8170b2bc>] do_softirq_own_stack+0x1c/0x30 [ 98.209107] <EOI> [ 98.211029] [<ffffffff8108a5c0>] do_softirq+0x50/0x60 [ 98.216166] [<ffffffff815ec853>] netif_rx_ni+0x33/0x80 [ 98.221386] [<ffffffffa09eeff7>] tun_get_user+0x487/0x7f0 [tun] [ 98.227388] [<ffffffffa09ef3ab>] tun_sendmsg+0x4b/0x60 [tun] [ 98.233129] [<ffffffffa0b68932>] handle_tx+0x282/0x540 [vhost_net] [ 98.239392] [<ffffffffa0b68c25>] handle_tx_kick+0x15/0x20 [vhost_net] [ 98.245916] [<ffffffffa0abacfe>] vhost_worker+0x9e/0xf0 [vhost] [ 98.251919] [<ffffffffa0abac60>] ? vhost_umem_alloc+0x40/0x40 [vhost] [ 98.258440] [<ffffffff81003a47>] ? do_syscall_64+0x67/0x180 [ 98.264094] [<ffffffff810a44d9>] kthread+0xd9/0xf0 [ 98.268965] [<ffffffff810a4400>] ? kthread_park+0x60/0x60 [ 98.274444] [<ffffffff8170a4d5>] ret_from_fork+0x25/0x30 [ 98.279836] Code: 8b 93 d8 00 00 00 48 2b 93 d0 00 00 00 4c 89 e6 48 89 df 66 89 93 c2 00 00 00 ff 10 48 3d 00 f0 ff ff 49 89 c2 0f 87 52 01 00 00 <41> 8b 92 cc 00 00 00 48 8b 80 d0 00 00 00 44 0f b7 74 10 06 66 [ 98.299425] RIP [<ffffffff816e52f9>] ipv6_gso_segment+0x119/0x2f0 [ 98.305612] RSP <ffff88012fc43a10> [ 98.309094] CR2: 00000000000000cc [ 98.312406] ---[ end trace 726a2c7a2d2d78d0 ]--- Signed-off-by: Artem Savkov <asavkov@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-10packet: fix race condition in packet_set_ringPhilip Pettersson
[ Upstream commit 84ac7260236a49c79eede91617700174c2c19b0c ] When packet_set_ring creates a ring buffer it will initialize a struct timer_list if the packet version is TPACKET_V3. This value can then be raced by a different thread calling setsockopt to set the version to TPACKET_V1 before packet_set_ring has finished. This leads to a use-after-free on a function pointer in the struct timer_list when the socket is closed as the previously initialized timer will not be deleted. The bug is fixed by taking lock_sock(sk) in packet_setsockopt when changing the packet version while also taking the lock at the start of packet_set_ring. Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.") Signed-off-by: Philip Pettersson <philip.pettersson@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-10GSO: Reload iph after pskb_may_pullArnaldo Carvalho de Melo
[ Upstream commit a510887824171ad260cc4a2603396c6247fdd091 ] As it may get stale and lead to use after free. Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Alexander Duyck <aduyck@mirantis.com> Cc: Andrey Konovalov <andreyknvl@google.com> Fixes: cbc53e08a793 ("GSO: Add GSO type for fixed IPv4 ID") Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-10net/dccp: fix use-after-free in dccp_invalid_packetEric Dumazet
[ Upstream commit 648f0c28df282636c0c8a7a19ca3ce5fc80a39c3 ] pskb_may_pull() can reallocate skb->head, we need to reload dh pointer in dccp_invalid_packet() or risk use after free. Bug found by Andrey Konovalov using syzkaller. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-10net: macb: fix the RX queue reset in macb_rx()Cyrille Pitchen
[ Upstream commit a0b44eea372b449ef9744fb1d90491cc063289b8 ] On macb only (not gem), when a RX queue corruption was detected from macb_rx(), the RX queue was reset: during this process the RX ring buffer descriptor was initialized by macb_init_rx_ring() but we forgot to also set bp->rx_tail to 0. Indeed, when processing the received frames, bp->rx_tail provides the macb driver with the index in the RX ring buffer of the next buffer to process. So when the whole ring buffer is reset we must also reset bp->rx_tail so the driver is synchronized again with the hardware. Since macb_init_rx_ring() is called from many locations, currently from macb_rx() and macb_init_rings(), we'd rather add the "bp->rx_tail = 0;" line inside macb_init_rx_ring() than add the very same line after each call of this function. Without this fix, the rx queue is not reset properly to recover from queue corruption and connection drop may occur. Signed-off-by: Cyrille Pitchen <cyrille.pitchen@atmel.com> Fixes: 9ba723b081a2 ("net: macb: remove BUG_ON() and reset the queue to handle RX errors") Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-10netlink: Do not schedule work from sk_destructHerbert Xu
[ Upstream commit ed5d7788a934a4b6d6d025e948ed4da496b4f12e ] It is wrong to schedule a work from sk_destruct using the socket as the memory reserve because the socket will be freed immediately after the return from sk_destruct. Instead we should do the deferral prior to sk_free. This patch does just that. Fixes: 707693c8a498 ("netlink: Call cb->done from a worker thread") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-10netlink: Call cb->done from a worker threadHerbert Xu
[ Upstream commit 707693c8a498697aa8db240b93eb76ec62e30892 ] The cb->done interface expects to be called in process context. This was broken by the netlink RCU conversion. This patch fixes it by adding a worker struct to make the cb->done call where necessary. Fixes: 21e4902aea80 ("netlink: Lockless lookup with RCU grace...") Reported-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-10net/sched: pedit: make sure that offset is validAmir Vadai
[ Upstream commit 95c2027bfeda21a28eb245121e6a249f38d0788e ] Add a validation function to make sure offset is valid: 1. Not below skb head (could happen when offset is negative). 2. Validate both 'offset' and 'at'. Signed-off-by: Amir Vadai <amir@vadai.me> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-10net: dsa: fix unbalanced dsa_switch_tree reference countingNikita Yushchenko
[ Upstream commit 7a99cd6e213685b78118382e6a8fed506c82ccb2 ] _dsa_register_switch() gets a dsa_switch_tree object either via dsa_get_dst() or via dsa_add_dst(). Former path does not increase kref in returned object (resulting into caller not owning a reference), while later path does create a new object (resulting into caller owning a reference). The rest of _dsa_register_switch() assumes that it owns a reference, and calls dsa_put_dst(). This causes a memory breakage if first switch in the tree initialized successfully, but second failed to initialize. In particular, freed dsa_swith_tree object is left referenced by switch that was initialized, and later access to sysfs attributes of that switch cause OOPS. To fix, need to add kref_get() call to dsa_get_dst(). Fixes: 83c0afaec7b7 ("net: dsa: Add new binding implementation") Signed-off-by: Nikita Yushchenko <nikita.yoush@cogentembedded.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-10net, sched: respect rcu grace period on cls destructionDaniel Borkmann
[ Upstream commit d936377414fadbafb4d17148d222fe45ca5442d4 ] Roi reported a crash in flower where tp->root was NULL in ->classify() callbacks. Reason is that in ->destroy() tp->root is set to NULL via RCU_INIT_POINTER(). It's problematic for some of the classifiers, because this doesn't respect RCU grace period for them, and as a result, still outstanding readers from tc_classify() will try to blindly dereference a NULL tp->root. The tp->root object is strictly private to the classifier implementation and holds internal data the core such as tc_ctl_tfilter() doesn't know about. Within some classifiers, such as cls_bpf, cls_basic, etc, tp->root is only checked for NULL in ->get() callback, but nowhere else. This is misleading and seemed to be copied from old classifier code that was not cleaned up properly. For example, d3fa76ee6b4a ("[NET_SCHED]: cls_basic: fix NULL pointer dereference") moved tp->root initialization into ->init() routine, where before it was part of ->change(), so ->get() had to deal with tp->root being NULL back then, so that was indeed a valid case, after d3fa76ee6b4a, not really anymore. We used to set tp->root to NULL long ago in ->destroy(), see 47a1a1d4be29 ("pkt_sched: remove unnecessary xchg() in packet classifiers"); but the NULLifying was reintroduced with the RCUification, but it's not correct for every classifier implementation. In the cases that are fixed here with one exception of cls_cgroup, tp->root object is allocated and initialized inside ->init() callback, which is always performed at a point in time after we allocate a new tp, which means tp and thus tp->root was not globally visible in the tp chain yet (see tc_ctl_tfilter()). Also, on destruction tp->root is strictly kfree_rcu()'ed in ->destroy() handler, same for the tp which is kfree_rcu()'ed right when we return from ->destroy() in tcf_destroy(). This means, the head object's lifetime for such classifiers is always tied to the tp lifetime. The RCU callback invocation for the two kfree_rcu() could be out of order, but that's fine since both are independent. Dropping the RCU_INIT_POINTER(tp->root, NULL) for these classifiers here means that 1) we don't need a useless NULL check in fast-path and, 2) that outstanding readers of that tp in tc_classify() can still execute under respect with RCU grace period as it is actually expected. Things that haven't been touched here: cls_fw and cls_route. They each handle tp->root being NULL in ->classify() path for historic reasons, so their ->destroy() implementation can stay as is. If someone actually cares, they could get cleaned up at some point to avoid the test in fast path. cls_u32 doesn't set tp->root to NULL. For cls_rsvp, I just added a !head should anyone actually be using/testing it, so it at least aligns with cls_fw and cls_route. For cls_flower we additionally need to defer rhashtable destruction (to a sleepable context) after RCU grace period as concurrent readers might still access it. (Note that in this case we need to hold module reference to keep work callback address intact, since we only wait on module unload for all call_rcu()s to finish.) This fixes one race to bring RCU grace period guarantees back. Next step as worked on by Cong however is to fix 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone") to get the order of unlinking the tp in tc_ctl_tfilter() for the RTM_DELTFILTER case right by moving RCU_INIT_POINTER() before tcf_destroy() and let the notification for removal be done through the prior ->delete() callback. Both are independant issues. Once we have that right, we can then clean tp->root up for a number of classifiers by not making them RCU pointers, which requires a new callback (->uninit) that is triggered from tp's RCU callback, where we just kfree() tp->root from there. Fixes: 1f947bf151e9 ("net: sched: rcu'ify cls_bpf") Fixes: 9888faefe132 ("net: sched: cls_basic use RCU") Fixes: 70da9f0bf999 ("net: sched: cls_flow use RCU") Fixes: 77b9900ef53a ("tc: introduce Flower classifier") Fixes: bf3994d2ed31 ("net/sched: introduce Match-all classifier") Fixes: 952313bd6258 ("net: sched: cls_cgroup use RCU") Reported-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Roi Dayan <roid@mellanox.com> Cc: Jiri Pirko <jiri@mellanox.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-10net: dsa: bcm_sf2: Ensure we re-negotiate EEE during after link changeFlorian Fainelli
[ Upstream commit 76da8706d90d8641eeb9b8e579942ed80b6c0880 ] In case the link change and EEE is enabled or disabled, always try to re-negotiate this with the link partner. Fixes: 450b05c15f9c ("net: dsa: bcm_sf2: add support for controlling EEE") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-10udplite: call proper backlog handlersEric Dumazet
[ Upstream commit 30c7be26fd3587abcb69587f781098e3ca2d565b ] In commits 93821778def10 ("udp: Fix rcv socket locking") and f7ad74fef3af ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into __udpv6_queue_rcv_skb") UDP backlog handlers were renamed, but UDPlite was forgotten. This leads to crashes if UDPlite header is pulled twice, which happens starting from commit e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Bug found by syzkaller team, thanks a lot guys ! Note that backlog use in UDP/UDPlite is scheduled to be removed starting from linux-4.10, so this patch is only needed up to linux-4.9 Fixes: 93821778def1 ("udp: Fix rcv socket locking") Fixes: f7ad74fef3af ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into __udpv6_queue_rcv_skb") Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-10ipv6: bump genid when the IFA_F_TENTATIVE flag is clearPaolo Abeni
[ Upstream commit 764d3be6e415b40056834bfd29b994dc3f837606 ] When an ipv6 address has the tentative flag set, it can't be used as source for egress traffic, while the associated route, if any, can be looked up and even stored into some dst_cache. In the latter scenario, the source ipv6 address selected and stored in the cache is most probably wrong (e.g. with link-local scope) and the entity using the dst_cache will experience lack of ipv6 connectivity until said cache is cleared or invalidated. Overall this may cause lack of connectivity over most IPv6 tunnels (comprising geneve and vxlan), if the first egress packet reaches the tunnel before the DaD is completed for the used ipv6 address. This patch bumps a new genid after that the IFA_F_TENTATIVE flag is cleared, so that dst_cache will be invalidated on next lookup and ipv6 connectivity restored. Fixes: 0c1d70af924b ("net: use dst_cache for vxlan device") Fixes: 468dfffcd762 ("geneve: add dst caching support") Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-10rtnl: fix the loop index update error in rtnl_dump_ifinfo()Zhang Shengju
[ Upstream commit 3f0ae05d6fea0ed5b19efdbc9c9f8e02685a3af3 ] If the link is filtered out, loop index should also be updated. If not, loop index will not be correct. Fixes: dc599f76c22b0 ("net: Add support for filtering link dump by master device and kind") Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-10l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind()Guillaume Nault
[ Upstream commit 32c231164b762dddefa13af5a0101032c70b50ef ] Lock socket before checking the SOCK_ZAPPED flag in l2tp_ip6_bind(). Without lock, a concurrent call could modify the socket flags between the sock_flag(sk, SOCK_ZAPPED) test and the lock_sock() call. This way, a socket could be inserted twice in l2tp_ip6_bind_table. Releasing it would then leave a stale pointer there, generating use-after-free errors when walking through the list or modifying adjacent entries. BUG: KASAN: use-after-free in l2tp_ip6_close+0x22e/0x290 at addr ffff8800081b0ed8 Write of size 8 by task syz-executor/10987 CPU: 0 PID: 10987 Comm: syz-executor Not tainted 4.8.0+ #39 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014 ffff880031d97838 ffffffff829f835b ffff88001b5a1640 ffff8800081b0ec0 ffff8800081b15a0 ffff8800081b6d20 ffff880031d97860 ffffffff8174d3cc ffff880031d978f0 ffff8800081b0e80 ffff88001b5a1640 ffff880031d978e0 Call Trace: [<ffffffff829f835b>] dump_stack+0xb3/0x118 lib/dump_stack.c:15 [<ffffffff8174d3cc>] kasan_object_err+0x1c/0x70 mm/kasan/report.c:156 [< inline >] print_address_description mm/kasan/report.c:194 [<ffffffff8174d666>] kasan_report_error+0x1f6/0x4d0 mm/kasan/report.c:283 [< inline >] kasan_report mm/kasan/report.c:303 [<ffffffff8174db7e>] __asan_report_store8_noabort+0x3e/0x40 mm/kasan/report.c:329 [< inline >] __write_once_size ./include/linux/compiler.h:249 [< inline >] __hlist_del ./include/linux/list.h:622 [< inline >] hlist_del_init ./include/linux/list.h:637 [<ffffffff8579047e>] l2tp_ip6_close+0x22e/0x290 net/l2tp/l2tp_ip6.c:239 [<ffffffff850b2dfd>] inet_release+0xed/0x1c0 net/ipv4/af_inet.c:415 [<ffffffff851dc5a0>] inet6_release+0x50/0x70 net/ipv6/af_inet6.c:422 [<ffffffff84c4581d>] sock_release+0x8d/0x1d0 net/socket.c:570 [<ffffffff84c45976>] sock_close+0x16/0x20 net/socket.c:1017 [<ffffffff817a108c>] __fput+0x28c/0x780 fs/file_table.c:208 [<ffffffff817a1605>] ____fput+0x15/0x20 fs/file_table.c:244 [<ffffffff813774f9>] task_work_run+0xf9/0x170 [<ffffffff81324aae>] do_exit+0x85e/0x2a00 [<ffffffff81326dc8>] do_group_exit+0x108/0x330 [<ffffffff81348cf7>] get_signal+0x617/0x17a0 kernel/signal.c:2307 [<ffffffff811b49af>] do_signal+0x7f/0x18f0 [<ffffffff810039bf>] exit_to_usermode_loop+0xbf/0x150 arch/x86/entry/common.c:156 [< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:190 [<ffffffff81006060>] syscall_return_slowpath+0x1a0/0x1e0 arch/x86/entry/common.c:259 [<ffffffff85e4d726>] entry_SYSCALL_64_fastpath+0xc4/0xc6 Object at ffff8800081b0ec0, in cache L2TP/IPv6 size: 1448 Allocated: PID = 10987 [ 1116.897025] [<ffffffff811ddcb6>] save_stack_trace+0x16/0x20 [ 1116.897025] [<ffffffff8174c736>] save_stack+0x46/0xd0 [ 1116.897025] [<ffffffff8174c9ad>] kasan_kmalloc+0xad/0xe0 [ 1116.897025] [<ffffffff8174cee2>] kasan_slab_alloc+0x12/0x20 [ 1116.897025] [< inline >] slab_post_alloc_hook mm/slab.h:417 [ 1116.897025] [< inline >] slab_alloc_node mm/slub.c:2708 [ 1116.897025] [< inline >] slab_alloc mm/slub.c:2716 [ 1116.897025] [<ffffffff817476a8>] kmem_cache_alloc+0xc8/0x2b0 mm/slub.c:2721 [ 1116.897025] [<ffffffff84c4f6a9>] sk_prot_alloc+0x69/0x2b0 net/core/sock.c:1326 [ 1116.897025] [<ffffffff84c58ac8>] sk_alloc+0x38/0xae0 net/core/sock.c:1388 [ 1116.897025] [<ffffffff851ddf67>] inet6_create+0x2d7/0x1000 net/ipv6/af_inet6.c:182 [ 1116.897025] [<ffffffff84c4af7b>] __sock_create+0x37b/0x640 net/socket.c:1153 [ 1116.897025] [< inline >] sock_create net/socket.c:1193 [ 1116.897025] [< inline >] SYSC_socket net/socket.c:1223 [ 1116.897025] [<ffffffff84c4b46f>] SyS_socket+0xef/0x1b0 net/socket.c:1203 [ 1116.897025] [<ffffffff85e4d685>] entry_SYSCALL_64_fastpath+0x23/0xc6 Freed: PID = 10987 [ 1116.897025] [<ffffffff811ddcb6>] save_stack_trace+0x16/0x20 [ 1116.897025] [<ffffffff8174c736>] save_stack+0x46/0xd0 [ 1116.897025] [<ffffffff8174cf61>] kasan_slab_free+0x71/0xb0 [ 1116.897025] [< inline >] slab_free_hook mm/slub.c:1352 [ 1116.897025] [< inline >] slab_free_freelist_hook mm/slub.c:1374 [ 1116.897025] [< inline >] slab_free mm/slub.c:2951 [ 1116.897025] [<ffffffff81748b28>] kmem_cache_free+0xc8/0x330 mm/slub.c:2973 [ 1116.897025] [< inline >] sk_prot_free net/core/sock.c:1369 [ 1116.897025] [<ffffffff84c541eb>] __sk_destruct+0x32b/0x4f0 net/core/sock.c:1444 [ 1116.897025] [<ffffffff84c5aca4>] sk_destruct+0x44/0x80 net/core/sock.c:1452 [ 1116.897025] [<ffffffff84c5ad33>] __sk_free+0x53/0x220 net/core/sock.c:1460 [ 1116.897025] [<ffffffff84c5af23>] sk_free+0x23/0x30 net/core/sock.c:1471 [ 1116.897025] [<ffffffff84c5cb6c>] sk_common_release+0x28c/0x3e0 ./include/net/sock.h:1589 [ 1116.897025] [<ffffffff8579044e>] l2tp_ip6_close+0x1fe/0x290 net/l2tp/l2tp_ip6.c:243 [ 1116.897025] [<ffffffff850b2dfd>] inet_release+0xed/0x1c0 net/ipv4/af_inet.c:415 [ 1116.897025] [<ffffffff851dc5a0>] inet6_release+0x50/0x70 net/ipv6/af_inet6.c:422 [ 1116.897025] [<ffffffff84c4581d>] sock_release+0x8d/0x1d0 net/socket.c:570 [ 1116.897025] [<ffffffff84c45976>] sock_close+0x16/0x20 net/socket.c:1017 [ 1116.897025] [<ffffffff817a108c>] __fput+0x28c/0x780 fs/file_table.c:208 [ 1116.897025] [<ffffffff817a1605>] ____fput+0x15/0x20 fs/file_table.c:244 [ 1116.897025] [<ffffffff813774f9>] task_work_run+0xf9/0x170 [ 1116.897025] [<ffffffff81324aae>] do_exit+0x85e/0x2a00 [ 1116.897025] [<ffffffff81326dc8>] do_group_exit+0x108/0x330 [ 1116.897025] [<ffffffff81348cf7>] get_signal+0x617/0x17a0 kernel/signal.c:2307 [ 1116.897025] [<ffffffff811b49af>] do_signal+0x7f/0x18f0 [ 1116.897025] [<ffffffff810039bf>] exit_to_usermode_loop+0xbf/0x150 arch/x86/entry/common.c:156 [ 1116.897025] [< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:190 [ 1116.897025] [<ffffffff81006060>] syscall_return_slowpath+0x1a0/0x1e0 arch/x86/entry/common.c:259 [ 1116.897025] [<ffffffff85e4d726>] entry_SYSCALL_64_fastpath+0xc4/0xc6 Memory state around the buggy address: ffff8800081b0d80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff8800081b0e00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff8800081b0e80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb ^ ffff8800081b0f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8800081b0f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== The same issue exists with l2tp_ip_bind() and l2tp_ip_bind_table. Fixes: c51ce49735c1 ("l2tp: fix oops in L2TP IP sockets for connect() AF_UNSPEC case") Reported-by: Baozeng Ding <sploving1@gmail.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Baozeng Ding <sploving1@gmail.com> Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-10rtnetlink: fix FDB size computationSabrina Dubroca
[ Upstream commit f82ef3e10a870acc19fa04f80ef5877eaa26f41e ] Add missing NDA_VLAN attribute's size. Fixes: 1e53d5bb8878 ("net: Pass VLAN ID to rtnl_fdb_notify.") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-10af_unix: conditionally use freezable blocking calls in readWANG Cong
[ Upstream commit 06a77b07e3b44aea2b3c0e64de420ea2cfdcbaa9 ] Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read") converts schedule_timeout() to its freezable version, it was probably correct at that time, but later, commit 2b514574f7e8 ("net: af_unix: implement splice for stream af_unix sockets") breaks the strong requirement for a freezable sleep, according to commit 0f9548ca1091: We shouldn't try_to_freeze if locks are held. Holding a lock can cause a deadlock if the lock is later acquired in the suspend or hibernate path (e.g. by dpm). Holding a lock can also cause a deadlock in the case of cgroup_freezer if a lock is held inside a frozen cgroup that is later acquired by a process outside that group. The pipe_lock is still held at that point. So use freezable version only for the recvmsg call path, avoid impact for Android. Fixes: 2b514574f7e8 ("net: af_unix: implement splice for stream af_unix sockets") Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Colin Cross <ccross@android.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-10net: sky2: Fix shutdown crashJeremy Linton
[ Upstream commit 06ba3b2133dc203e1e9bc36cee7f0839b79a9e8b ] The sky2 frequently crashes during machine shutdown with: sky2_get_stats+0x60/0x3d8 [sky2] dev_get_stats+0x68/0xd8 rtnl_fill_stats+0x54/0x140 rtnl_fill_ifinfo+0x46c/0xc68 rtmsg_ifinfo_build_skb+0x7c/0xf0 rtmsg_ifinfo.part.22+0x3c/0x70 rtmsg_ifinfo+0x50/0x5c netdev_state_change+0x4c/0x58 linkwatch_do_dev+0x50/0x88 __linkwatch_run_queue+0x104/0x1a4 linkwatch_event+0x30/0x3c process_one_work+0x140/0x3e0 worker_thread+0x60/0x44c kthread+0xdc/0xf0 ret_from_fork+0x10/0x50 This is caused by the sky2 being called after it has been shutdown. A previous thread about this can be found here: https://lkml.org/lkml/2016/4/12/410 An alternative fix is to assure that IFF_UP gets cleared by calling dev_close() during shutdown. This is similar to what the bnx2/tg3/xgene and maybe others are doing to assure that the driver isn't being called following _shutdown(). Signed-off-by: Jeremy Linton <jeremy.linton@arm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-10ip6_tunnel: disable caching when the traffic class is inheritedPaolo Abeni
[ Upstream commit b5c2d49544e5930c96e2632a7eece3f4325a1888 ] If an ip6 tunnel is configured to inherit the traffic class from the inner header, the dst_cache must be disabled or it will foul the policy routing. The issue is apprently there since at leat Linux-2.6.12-rc2. Reported-by: Liam McBirnie <liam.mcbirnie@boeing.com> Cc: Liam McBirnie <liam.mcbirnie@boeing.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-10net: check dead netns for peernet2id_alloc()WANG Cong
[ Upstream commit cfc44a4d147ea605d66ccb917cc24467d15ff867 ] Andrei reports we still allocate netns ID from idr after we destroy it in cleanup_net(). cleanup_net(): ... idr_destroy(&net->netns_ids); ... list_for_each_entry_reverse(ops, &pernet_list, list) ops_exit_list(ops, &net_exit_list); -> rollback_registered_many() -> rtmsg_ifinfo_build_skb() -> rtnl_fill_ifinfo() -> peernet2id_alloc() After that point we should not even access net->netns_ids, we should check the death of the current netns as early as we can in peernet2id_alloc(). For net-next we can consider to avoid sending rtmsg totally, it is a good optimization for netns teardown path. Fixes: 0c7aecd4bde4 ("netns: add rtnl cmd to add and get peer netns ids") Reported-by: Andrei Vagin <avagin@gmail.com> Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Andrei Vagin <avagin@openvz.org> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-10net: dsa: b53: Fix VLAN usage and how we treat CPU portFlorian Fainelli
[ Upstream commit e47112d9d6009bf6b7438cedc0270316d6b0370d ] We currently have a fundamental problem in how we treat the CPU port and its VLAN membership. As soon as a second VLAN is configured to be untagged, the CPU automatically becomes untagged for that VLAN as well, and yet, we don't gracefully make sure that the CPU becomes tagged in the other VLANs it could be a member of. This results in only one VLAN being effectively usable from the CPU's perspective. Instead of having some pretty complex logic which tries to maintain the CPU port's default VLAN and its untagged properties, just do something very simple which consists in neither altering the CPU port's PVID settings, nor its untagged settings: - whenever a VLAN is added, the CPU is automatically a member of this VLAN group, as a tagged member - PVID settings for downstream ports do not alter the CPU port's PVID since it now is part of all VLANs in the system This means that a typical example where e.g: LAN ports are in VLAN1, and WAN port is in VLAN2, now require having two VLAN interfaces for the host to properly terminate and send traffic from/to. Fixes: Fixes: a2482d2ce349 ("net: dsa: b53: Plug in VLAN support") Reported-by: Hartmut Knaack <knaack.h@gmx.de> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-10virtio-net: add a missing synchronize_net()Eric Dumazet
[ Upstream commit 963abe5c8a0273a1cf5913556da1b1189de0e57a ] It seems many drivers do not respect napi_hash_del() contract. When napi_hash_del() is used before netif_napi_del(), an RCU grace period is needed before freeing NAPI object. Fixes: 91815639d880 ("virtio-net: rx busy polling support") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Michael S. Tsirkin <mst@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-10gro_cells: mark napi struct as not busy poll candidatesEric Dumazet
[ Upstream commit e88a2766143a27bfe6704b4493b214de4094cf29 ] Rolf Neugebauer reported very long delays at netns dismantle. Eric W. Biederman was kind enough to look at this problem and noticed synchronize_net() occurring from netif_napi_del() that was added in linux-4.5 Busy polling makes no sense for tunnels NAPI. If busy poll is used for sessions over tunnels, the poller will need to poll the physical device queue anyway. netif_tx_napi_add() could be used here, but function name is misleading, and renaming it is not stable material, so set NAPI_STATE_NO_BUSY_POLL bit directly. This will avoid inserting gro_cells napi structures in napi_hash[] and avoid the problematic synchronize_net() (per possible cpu) that Rolf reported. Fixes: 93d05d4a320c ("net: provide generic busy polling to all NAPI drivers") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Rolf Neugebauer <rolf.neugebauer@docker.com> Reported-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Tested-by: Rolf Neugebauer <rolf.neugebauer@docker.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-08Linux 4.8.13v4.8.13Greg Kroah-Hartman
2016-12-08arm64: suspend: Reconfigure PSTATE after resume from idleJames Morse
commit d08544127d9fb4505635e3cb6871fd50a42947bd upstream. The suspend/resume path in kernel/sleep.S, as used by cpu-idle, does not save/restore PSTATE. As a result of this cpufeatures that were detected and have bits in PSTATE get lost when we resume from idle. UAO gets set appropriately on the next context switch. PAN will be re-enabled next time we return from user-space, but on a preemptible kernel we may run work accessing user space before this point. Add code to re-enable theses two features in __cpu_suspend_exit(). We re-use uao_thread_switch() passing current. Signed-off-by: James Morse <james.morse@arm.com> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-08arm64: mm: Set PSTATE.PAN from the cpu_enable_pan() callJames Morse
commit 7209c868600bd8926e37c10b9aae83124ccc1dd8 upstream. Commit 338d4f49d6f7 ("arm64: kernel: Add support for Privileged Access Never") enabled PAN by enabling the 'SPAN' feature-bit in SCTLR_EL1. This means the PSTATE.PAN bit won't be set until the next return to the kernel from userspace. On a preemptible kernel we may schedule work that accesses userspace on a CPU before it has done this. Now that cpufeature enable() calls are scheduled via stop_machine(), we can set PSTATE.PAN from the cpu_enable_pan() call. Add WARN_ON_ONCE(in_interrupt()) to check the PSTATE value we updated is not immediately discarded. Reported-by: Tony Thompson <anthony.thompson@arm.com> Reported-by: Vladimir Murzin <vladimir.murzin@arm.com> Signed-off-by: James Morse <james.morse@arm.com> [will: fixed typo in comment] Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-08arm64: cpufeature: Schedule enable() calls instead of calling them via IPIJames Morse
commit 2a6dcb2b5f3e21592ca8dfa198dcce7bec09b020 upstream. The enable() call for a cpufeature/errata is called using on_each_cpu(). This issues a cross-call IPI to get the work done. Implicitly, this stashes the running PSTATE in SPSR when the CPU receives the IPI, and restores it when we return. This means an enable() call can never modify PSTATE. To allow PAN to do this, change the on_each_cpu() call to use stop_machine(). This schedules the work on each CPU which allows us to modify PSTATE. This involves changing the protype of all the enable() functions. enable_cpu_capabilities() is called during boot and enables the feature on all online CPUs. This path now uses stop_machine(). CPU features for hotplug'd CPUs are enabled by verify_local_cpu_features() which only acts on the local CPU, and can already modify the running PSTATE as it is called from secondary_start_kernel(). Reported-by: Tony Thompson <anthony.thompson@arm.com> Reported-by: Vladimir Murzin <vladimir.murzin@arm.com> Signed-off-by: James Morse <james.morse@arm.com> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> [Removed enable() hunks for A53 workaround] Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-08batman-adv: Detect missing primaryif during tp_send as errorSven Eckelmann
commit e13258f38e927b61cdb5f4ad25309450d3b127d1 upstream. The throughput meter detects different situations as problems for the current test. It stops the test after these and reports it to userspace. This also has to be done when the primary interface disappeared during the test. Fixes: 33a3bb4a3345 ("batman-adv: throughput meter implementation") Reported-by: Joe Perches <joe@perches.com> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-08clk: sunxi: Fix M factor computation for APB1Stéphan Rafin
commit ac95330b96376550ae7a533d1396272d675adfa2 upstream. commit cfa636886033 ("clk: sunxi: factors: Consolidate get_factors parameters into a struct") introduced a regression for m factor computation in sun4i_get_apb1_factors function. The old code reassigned the "parent_rate" parameter to the targeted divisor value and was buggy for the returned frequency but not for the computed factors. Now, returned frequency is good but m factor is incorrectly computed (its max value 31 is always set resulting in a significantly slower frequency than the requested one...) This patch simply restores the original proper computation for m while keeping the good changes for returned rate. Fixes: cfa636886033 ("clk: sunxi: factors: Consolidate get_factors parameters into a struct") Signed-off-by: Stéphan Rafin <stephan@soliotek.com> Signed-off-by: Maxime Ripard <maxime.ripard@free-electrons.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-08perf/x86: Restore TASK_SIZE check on frame pointerJohannes Weiner
commit ae31fe51a3cceaa0cabdb3058f69669ecb47f12e upstream. The following commit: 75925e1ad7f5 ("perf/x86: Optimize stack walk user accesses") ... switched from copy_from_user_nmi() to __copy_from_user_nmi() with a manual access_ok() check. Unfortunately, copy_from_user_nmi() does an explicit check against TASK_SIZE, whereas the access_ok() uses whatever the current address limit of the task is. We are getting NMIs when __probe_kernel_read() has switched to KERNEL_DS, and then see vmalloc faults when we access what looks like pointers into vmalloc space: [] WARNING: CPU: 3 PID: 3685731 at arch/x86/mm/fault.c:435 vmalloc_fault+0x289/0x290 [] CPU: 3 PID: 3685731 Comm: sh Tainted: G W 4.6.0-5_fbk1_223_gdbf0f40 #1 [] Call Trace: [] <NMI> [<ffffffff814717d1>] dump_stack+0x4d/0x6c [] [<ffffffff81076e43>] __warn+0xd3/0xf0 [] [<ffffffff81076f2d>] warn_slowpath_null+0x1d/0x20 [] [<ffffffff8104a899>] vmalloc_fault+0x289/0x290 [] [<ffffffff8104b5a0>] __do_page_fault+0x330/0x490 [] [<ffffffff8104b70c>] do_page_fault+0xc/0x10 [] [<ffffffff81794e82>] page_fault+0x22/0x30 [] [<ffffffff81006280>] ? perf_callchain_user+0x100/0x2a0 [] [<ffffffff8115124f>] get_perf_callchain+0x17f/0x190 [] [<ffffffff811512c7>] perf_callchain+0x67/0x80 [] [<ffffffff8114e750>] perf_prepare_sample+0x2a0/0x370 [] [<ffffffff8114e840>] perf_event_output+0x20/0x60 [] [<ffffffff8114aee7>] ? perf_event_update_userpage+0xc7/0x130 [] [<ffffffff8114ea01>] __perf_event_overflow+0x181/0x1d0 [] [<ffffffff8114f484>] perf_event_overflow+0x14/0x20 [] [<ffffffff8100a6e3>] intel_pmu_handle_irq+0x1d3/0x490 [] [<ffffffff8147daf7>] ? copy_user_enhanced_fast_string+0x7/0x10 [] [<ffffffff81197191>] ? vunmap_page_range+0x1a1/0x2f0 [] [<ffffffff811972f1>] ? unmap_kernel_range_noflush+0x11/0x20 [] [<ffffffff814f2056>] ? ghes_copy_tofrom_phys+0x116/0x1f0 [] [<ffffffff81040d1d>] ? x2apic_send_IPI_self+0x1d/0x20 [] [<ffffffff8100411d>] perf_event_nmi_handler+0x2d/0x50 [] [<ffffffff8101ea31>] nmi_handle+0x61/0x110 [] [<ffffffff8101ef94>] default_do_nmi+0x44/0x110 [] [<ffffffff8101f13b>] do_nmi+0xdb/0x150 [] [<ffffffff81795187>] end_repeat_nmi+0x1a/0x1e [] [<ffffffff8147daf7>] ? copy_user_enhanced_fast_string+0x7/0x10 [] [<ffffffff8147daf7>] ? copy_user_enhanced_fast_string+0x7/0x10 [] [<ffffffff8147daf7>] ? copy_user_enhanced_fast_string+0x7/0x10 [] <<EOE>> <IRQ> [<ffffffff8115d05e>] ? __probe_kernel_read+0x3e/0xa0 Fix this by moving the valid_user_frame() check to before the uaccess that loads the return address and the pointer to the next frame. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: linux-kernel@vger.kernel.org Fixes: 75925e1ad7f5 ("perf/x86: Optimize stack walk user accesses") Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-08drm/mediatek: fix null pointer dereferenceMatthias Brugger
commit 5ad45307d990020b25a8f7486178b6e033790f70 upstream. The probe function requests the interrupt before initializing the ddp component. Which leads to a null pointer dereference at boot. Fix this by requesting the interrput after all components got initialized properly. Fixes: 119f5173628a ("drm/mediatek: Add DRM Driver for Mediatek SoC MT8173.") Signed-off-by: Matthias Brugger <matthias.bgg@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Change-Id: I57193a7ab554dfb37c35a455900689333adf511c
2016-12-08pwm: Fix device reference leakJohan Hovold
commit 0e1614ac84f1719d87bed577963bb8140d0c9ce8 upstream. Make sure to drop the reference to the parent device taken by class_find_device() after "unexporting" any children when deregistering a PWM chip. Fixes: 0733424c9ba9 ("pwm: Unexport children before chip removal") Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Thierry Reding <thierry.reding@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-08KVM: use after free in kvm_ioctl_create_device()Dan Carpenter
commit a0f1d21c1ccb1da66629627a74059dd7f5ac9c61 upstream. We should move the ops->destroy(dev) after the list_del(&dev->vm_node) so that we don't use "dev" after freeing it. Fixes: a28ebea2adc4 ("KVM: Protect device ops->create and list_add with kvm->lock") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-08arm64: dts: juno: fix cluster sleep state entry latency on all SoC versionsSudeep Holla
commit 909e481e2467f202b97d42beef246e8829416a85 upstream. The core and the cluster sleep state entry latencies can't be same as cluster sleep involves more work compared to core level e.g. shared cache maintenance. Experiments have shown on an average about 100us more latency for the cluster sleep state compared to the core level sleep. This patch fixes the entry latency for the cluster sleep state. Fixes: 28e10a8f3a03 ("arm64: dts: juno: Add idle-states to device tree") Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Cc: "Jon Medhurst (Tixy)" <tixy@linaro.org> Reviewed-by: Liviu Dudau <Liviu.Dudau@arm.com> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-08drm/radeon: fix check for port PM availabilityAlex Deucher
commit bcfdd5d5105087e6f33dfeb08a1ca6b2c0287b61 upstream. The ATPX method does not always exist on the dGPU, it may be located at the iGPU. The parent device of the iGPU is the root port for which bridge_d3 is false. This accidentally enables the legacy PM method which conflicts with port PM and prevented the dGPU from powering on. Ported from amdgpu commit: drm/amdgpu: fix check for port PM availability from Peter Wu. Fixes: d3ac31f3b4bf9fad (drm/radeon: fix power state when port pm is unavailable (v2)) Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: Peter Wu <peter@lekensteyn.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-08drm/amdgpu: fix check for port PM availabilityPeter Wu
commit 7ac33e47d5769632010e537964c7e45498f8dc26 upstream. The ATPX method does not always exist on the dGPU, it may be located at the iGPU. The parent device of the iGPU is the root port for which bridge_d3 is false. This accidentally enables the legacy PM method which conflicts with port PM and prevented the dGPU from powering on. Fixes: 1db4496f167b ("drm/amdgpu: fix power state when port pm is unavailable") Reported-and-tested-by: Mike Lothian <mike@fireburn.co.uk> Signed-off-by: Peter Wu <peter@lekensteyn.nl> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-08drm/radeon: fix power state when port pm is unavailable (v2)Peter Wu
commit d3ac31f3b4bf9fade93d69770cb9c34912e017be upstream. When PCIe port PM is not enabled (system BIOS is pre-2015 or the pcie_port_pm=off parameter is set), legacy ATPX PM should still be marked as supported. Otherwise the GPU can fail to power on after runtime suspend. This affected a Dell Inspiron 5548. Ideally the BIOS date in the PCI core is lowered to 2013 (the first year where hybrid graphics platforms using power resources was introduced), but that seems more risky at this point and would not solve the pcie_port_pm=off issue. v2: agd: fix typo Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=98505 Signed-off-by: Peter Wu <peter@lekensteyn.nl> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-08drm/amdgpu: fix power state when port pm is unavailablePeter Wu
commit 1db4496f167bcc7c6541d449355ade2e7d339d52 upstream. When PCIe port PM is not enabled (system BIOS is pre-2015 or the pcie_port_pm=off parameter is set), legacy ATPX PM should still be marked as supported. Otherwise the GPU can fail to power on after runtime suspend. This affected a Dell Inspiron 5548. Ideally the BIOS date in the PCI core is lowered to 2013 (the first year where hybrid graphics platforms using power resources was introduced), but that seems more risky at this point and would not solve the pcie_port_pm=off issue. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=98505 Reported-and-tested-by: Nayan Deshmukh <nayan26deshmukh@gmail.com> Signed-off-by: Peter Wu <peter@lekensteyn.nl> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-08drm/i915: drop the struct_mutex when wedged or trying to resetMatthew Auld
commit e411072d5740a49cdc9d0713798c30440757e451 upstream. We grab the struct_mutex in intel_crtc_page_flip, but if we are wedged or a reset is in progress we bail early but never seem to actually release the lock. Fixes: 7f1847ebf48b ("drm/i915: Simplify checking of GPU reset_counter in display pageflips") Cc: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Matthew Auld <matthew.auld@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20161128103648.9235-1-matthew.auld@intel.com Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> (cherry picked from commit ddbb271aea87fc6004d3c8bcdb0710e980c7ec85) Signed-off-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-08drm/i915: Don't touch NULL sg on i915_gem_object_get_pages_gtt() errorChris Wilson
commit 2420489bcb8910188578acc0c11c75445c2e4b92 upstream. On the DMA mapping error path, sg may be NULL (it has already been marked as the last scatterlist entry), and we should avoid dereferencing it again. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: e227330223a7 ("drm/i915: avoid leaking DMA mappings") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Imre Deak <imre.deak@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20161114112930.2033-1-chris@chris-wilson.co.uk Reviewed-by: Matthew Auld <matthew.auld@intel.com> (cherry picked from commit b17993b7b29612369270567643bcff814f4b3d7f) Signed-off-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-08KVM: arm/arm64: vgic: Don't notify EOI for non-SPIsMarc Zyngier
commit 8ca18eec2b2276b449c1dc86b98bf083c5fe4e09 upstream. When we inject a level triggerered interrupt (and unless it is backed by the physical distributor - timer style), we request a maintenance interrupt. Part of the processing for that interrupt is to feed to the rest of KVM (and to the eventfd subsystem) the information that the interrupt has been EOIed. But that notification only makes sense for SPIs, and not PPIs (such as the PMU interrupt). Skip over the notification if the interrupt is not an SPI. Fixes: 140b086dd197 ("KVM: arm/arm64: vgic-new: Add GICv2 world switch backend") Fixes: 59529f69f504 ("KVM: arm/arm64: vgic-new: Add GICv3 world switch backend") Reported-by: Catalin Marinas <catalin.marinas@arm.com> Tested-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-08mwifiex: printk() overflow with 32-byte SSIDsBrian Norris
commit fcd2042e8d36cf644bd2d69c26378d17158b17df upstream. SSIDs aren't guaranteed to be 0-terminated. Let's cap the max length when we print them out. This can be easily noticed by connecting to a network with a 32-octet SSID: [ 3903.502925] mwifiex_pcie 0000:01:00.0: info: trying to associate to '0123456789abcdef0123456789abcdef <uninitialized mem>' bssid xx:xx:xx:xx:xx:xx Fixes: 5e6e3a92b9a4 ("wireless: mwifiex: initial commit for Marvell mwifiex driver") Signed-off-by: Brian Norris <briannorris@chromium.org> Acked-by: Amitkumar Karwar <akarwar@marvell.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-08PCI: Set Read Completion Boundary to 128 iff Root Port supports it (_HPX)Johannes Thumshirn
commit e42010d8207f9d15a605ceb8e321bcd9648071b0 upstream. Per PCIe spec r3.0, sec 2.3.1.1, the Read Completion Boundary (RCB) determines the naturally aligned address boundaries on which a Read Request may be serviced with multiple Completions: - For a Root Complex, RCB is 64 bytes or 128 bytes This value is reported in the Link Control Register Note: Bridges and Endpoints may implement a corresponding command bit which may be set by system software to indicate the RCB value for the Root Complex, allowing the Bridge/Endpoint to optimize its behavior when the Root Complex’s RCB is 128 bytes. - For all other system elements, RCB is 128 bytes Per sec 7.8.7, if a Root Port only supports a 64-byte RCB, the RCB of all downstream devices must be clear, indicating an RCB of 64 bytes. If the Root Port supports a 128-byte RCB, we may optionally set the RCB of downstream devices so they know they can generate larger Completions. Some BIOSes supply an _HPX that tells us to set RCB, even though the Root Port doesn't have RCB set, which may lead to Malformed TLP errors if the Endpoint generates completions larger than the Root Port can handle. The IBM x3850 X6 with BIOS version -[A8E120CUS-1.30]- 08/22/2016 supplies such an _HPX and a Mellanox MT27500 ConnectX-3 device fails to initialize: mlx4_core 0000:41:00.0: command 0xfff timed out (go bit not cleared) mlx4_core 0000:41:00.0: device is going to be reset mlx4_core 0000:41:00.0: Failed to obtain HW semaphore, aborting mlx4_core 0000:41:00.0: Fail to reset HCA ------------[ cut here ]------------ kernel BUG at drivers/net/ethernet/mellanox/mlx4/catas.c:193! After 6cd33649fa83 ("PCI: Add pci_configure_device() during enumeration") and 7a1562d4f2d0 ("PCI: Apply _HPX Link Control settings to all devices with a link"), we apply _HPX settings to *all* devices, not just those hot-added after boot. Before 7a1562d4f2d0, we didn't touch the Mellanox RCB, and the device worked. After 7a1562d4f2d0, we set its RCB to 128, and it failed. Set the RCB to 128 iff the Root Port supports a 128-byte RCB. Otherwise, set RCB to 64 bytes. This effectively ignores what _HPX tells us about RCB. Note that this change only affects _HPX handling. If we have no _HPX, this does nothing with RCB. [bhelgaas: changelog, clear RCB if not set for Root Port] Fixes: 6cd33649fa83 ("PCI: Add pci_configure_device() during enumeration") Fixes: 7a1562d4f2d0 ("PCI: Apply _HPX Link Control settings to all devices with a link") Link: https://bugzilla.kernel.org/show_bug.cgi?id=187781 Tested-by: Frank Danapfel <fdanapfe@redhat.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Myron Stowe <myron.stowe@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>