summaryrefslogtreecommitdiffstats
path: root/net/sched/sch_api.c
AgeCommit message (Collapse)Author
2016-05-24net_sched: avoid too many hrtimer_start() callsEric Dumazet
I found a serious performance bug in packet schedulers using hrtimers. sch_htb and sch_fq are definitely impacted by this problem. We constantly rearm high resolution timers if some packets are throttled in one (or more) class, and other packets are flying through qdisc on another (non throttled) class. hrtimer_start() does not have the mod_timer() trick of doing nothing if expires value does not change : if (timer_pending(timer) && timer->expires == expires) return 1; This issue is particularly visible when multiple cpus can queue/dequeue packets on the same qdisc, as hrtimer code has to lock a remote base. I used following fix : 1) Change htb to use qdisc_watchdog_schedule_ns() instead of open-coding it. 2) Cache watchdog prior expiration. hrtimer might provide this, but I prefer to not rely on some hrtimer internal. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26sched: align nlattr properly when neededNicolas Dichtel
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-29net_sched: update hierarchical backlog tooWANG Cong
When the bottom qdisc decides to, for example, drop some packet, it calls qdisc_tree_decrease_qlen() to update the queue length for all its ancestors, we need to update the backlog too to keep the stats on root qdisc accurate. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/phy/bcm7xxx.c drivers/net/phy/marvell.c drivers/net/vxlan.c All three conflicts were cases of simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18net_sched: Improve readability of filter processingJamal Hadi Salim
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18net_sched fix: reclassification needs to consider ether protocol changesJamal Hadi Salim
actions could change the etherproto in particular with ethernet tunnelled data. Typically such actions, after peeling the outer header, will ask for the packet to be reclassified. We then need to restart the classification with the new proto header. Example setup used to catch this: sudo tc qdisc add dev $ETH ingress sudo $TC filter add dev $ETH parent ffff: pref 1 protocol 802.1Q \ u32 match u32 0 0 flowid 1:1 \ action vlan pop reclassify Fixes: 3b3ae880266d ("net: sched: consolidate tc_classify{,_compat}") Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15net_sched: make qdisc_tree_decrease_qlen() work for non mqEric Dumazet
Stas Nichiporovich reported a regression in his HFSC qdisc setup on a non multi queue device. It turns out I mistakenly added a TCQ_F_NOPARENT flag on all qdisc allocated in qdisc_create() for non multi queue devices, which was rather buggy. I was clearly mislead by the TCQ_F_ONETXQUEUE that is also set here for no good reason, since it only matters for the root qdisc. Fixes: 4eaf3b84f288 ("net_sched: fix qdisc_tree_decrease_qlen() races") Reported-by: Stas Nichiporovich <stasn77@gmail.com> Tested-by: Stas Nichiporovich <stasn77@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-03net_sched: fix qdisc_tree_decrease_qlen() racesEric Dumazet
qdisc_tree_decrease_qlen() suffers from two problems on multiqueue devices. One problem is that it updates sch->q.qlen and sch->qstats.drops on the mq/mqprio root qdisc, while it should not : Daniele reported underflows errors : [ 681.774821] PAX: sch->q.qlen: 0 n: 1 [ 681.774825] PAX: size overflow detected in function qdisc_tree_decrease_qlen net/sched/sch_api.c:769 cicus.693_49 min, count: 72, decl: qlen; num: 0; context: sk_buff_head; [ 681.774954] CPU: 2 PID: 19 Comm: ksoftirqd/2 Tainted: G O 4.2.6.201511282239-1-grsec #1 [ 681.774955] Hardware name: ASUSTeK COMPUTER INC. X302LJ/X302LJ, BIOS X302LJ.202 03/05/2015 [ 681.774956] ffffffffa9a04863 0000000000000000 0000000000000000 ffffffffa990ff7c [ 681.774959] ffffc90000d3bc38 ffffffffa95d2810 0000000000000007 ffffffffa991002b [ 681.774960] ffffc90000d3bc68 ffffffffa91a44f4 0000000000000001 0000000000000001 [ 681.774962] Call Trace: [ 681.774967] [<ffffffffa95d2810>] dump_stack+0x4c/0x7f [ 681.774970] [<ffffffffa91a44f4>] report_size_overflow+0x34/0x50 [ 681.774972] [<ffffffffa94d17e2>] qdisc_tree_decrease_qlen+0x152/0x160 [ 681.774976] [<ffffffffc02694b1>] fq_codel_dequeue+0x7b1/0x820 [sch_fq_codel] [ 681.774978] [<ffffffffc02680a0>] ? qdisc_peek_dequeued+0xa0/0xa0 [sch_fq_codel] [ 681.774980] [<ffffffffa94cd92d>] __qdisc_run+0x4d/0x1d0 [ 681.774983] [<ffffffffa949b2b2>] net_tx_action+0xc2/0x160 [ 681.774985] [<ffffffffa90664c1>] __do_softirq+0xf1/0x200 [ 681.774987] [<ffffffffa90665ee>] run_ksoftirqd+0x1e/0x30 [ 681.774989] [<ffffffffa90896b0>] smpboot_thread_fn+0x150/0x260 [ 681.774991] [<ffffffffa9089560>] ? sort_range+0x40/0x40 [ 681.774992] [<ffffffffa9085fe4>] kthread+0xe4/0x100 [ 681.774994] [<ffffffffa9085f00>] ? kthread_worker_fn+0x170/0x170 [ 681.774995] [<ffffffffa95d8d1e>] ret_from_fork+0x3e/0x70 mq/mqprio have their own ways to report qlen/drops by folding stats on all their queues, with appropriate locking. A second problem is that qdisc_tree_decrease_qlen() calls qdisc_lookup() without proper locking : concurrent qdisc updates could corrupt the list that qdisc_match_from_root() parses to find a qdisc given its handle. Fix first problem adding a TCQ_F_NOPARENT qdisc flag that qdisc_tree_decrease_qlen() can use to abort its tree traversal, as soon as it meets a mq/mqprio qdisc children. Second problem can be fixed by RCU protection. Qdisc are already freed after RCU grace period, so qdisc_list_add() and qdisc_list_del() simply have to use appropriate rcu list variants. A future patch will add a per struct netdev_queue list anchor, so that qdisc_tree_decrease_qlen() can have more efficient lookups. Reported-by: Daniele Fucini <dfucini@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Cong Wang <cwang@twopensource.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-28net: sched: don't break line in tc_classify loop notificationDaniel Borkmann
Just some minor noise follow-up to address some stylistic issues of commit 3b3ae880266d ("net: sched: consolidate tc_classify{,_compat}"). Accidentally v1 instead of v2 of that commit got applied, so this patch adds the relative diff. Suggested-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27net: sched: register noqueue qdiscPhil Sutter
This way users can attach noqueue just like any other qdisc using tc without having to mess with tx_queue_len first. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27net: sched: consolidate tc_classify{,_compat}Daniel Borkmann
For classifiers getting invoked via tc_classify(), we always need an extra function call into tc_classify_compat(), as both are being exported as symbols and tc_classify() itself doesn't do much except handling of reclassifications when tp->classify() returned with TC_ACT_RECLASSIFY. CBQ and ATM are the only qdiscs that directly call into tc_classify_compat(), all others use tc_classify(). When tc actions are being configured out in the kernel, tc_classify() effectively does nothing besides delegating. We could spare this layer and consolidate both functions. pktgen on single CPU constantly pushing skbs directly into the netif_receive_skb() path with a dummy classifier on ingress qdisc attached, improves slightly from 22.3Mpps to 23.1Mpps. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: 1) Add TX fast path in mac80211, from Johannes Berg. 2) Add TSO/GRO support to ibmveth, from Thomas Falcon 3) Move away from cached routes in ipv6, just like ipv4, from Martin KaFai Lau. 4) Lots of new rhashtable tests, from Thomas Graf. 5) Run ingress qdisc lockless, from Alexei Starovoitov. 6) Allow servers to fetch TCP packet headers for SYN packets of new connections, for fingerprinting. From Eric Dumazet. 7) Add mode parameter to pktgen, for testing receive. From Alexei Starovoitov. 8) Cache access optimizations via simplifications of build_skb(), from Alexander Duyck. 9) Move page frag allocator under mm/, also from Alexander. 10) Add xmit_more support to hv_netvsc, from KY Srinivasan. 11) Add a counter guard in case we try to perform endless reclassify loops in the packet scheduler. 12) Extern flow dissector to be programmable and use it in new "Flower" classifier. From Jiri Pirko. 13) AF_PACKET fanout rollover fixes, performance improvements, and new statistics. From Willem de Bruijn. 14) Add netdev driver for GENEVE tunnels, from John W Linville. 15) Add ingress netfilter hooks and filtering, from Pablo Neira Ayuso. 16) Fix handling of epoll edge triggers in TCP, from Eric Dumazet. 17) Add an ECN retry fallback for the initial TCP handshake, from Daniel Borkmann. 18) Add tail call support to BPF, from Alexei Starovoitov. 19) Add several pktgen helper scripts, from Jesper Dangaard Brouer. 20) Add zerocopy support to AF_UNIX, from Hannes Frederic Sowa. 21) Favor even port numbers for allocation to connect() requests, and odd port numbers for bind(0), in an effort to help avoid ip_local_port_range exhaustion. From Eric Dumazet. 22) Add Cavium ThunderX driver, from Sunil Goutham. 23) Allow bpf programs to access skb_iif and dev->ifindex SKB metadata, from Alexei Starovoitov. 24) Add support for T6 chips in cxgb4vf driver, from Hariprasad Shenai. 25) Double TCP Small Queues default to 256K to accomodate situations like the XEN driver and wireless aggregation. From Wei Liu. 26) Add more entropy inputs to flow dissector, from Tom Herbert. 27) Add CDG congestion control algorithm to TCP, from Kenneth Klette Jonassen. 28) Convert ipset over to RCU locking, from Jozsef Kadlecsik. 29) Track and act upon link status of ipv4 route nexthops, from Andy Gospodarek. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1670 commits) bridge: vlan: flush the dynamically learned entries on port vlan delete bridge: multicast: add a comment to br_port_state_selection about blocking state net: inet_diag: export IPV6_V6ONLY sockopt stmmac: troubleshoot unexpected bits in des0 & des1 net: ipv4 sysctl option to ignore routes when nexthop link is down net: track link-status of ipv4 nexthops net: switchdev: ignore unsupported bridge flags net: Cavium: Fix MAC address setting in shutdown state drivers: net: xgene: fix for ACPI support without ACPI ip: report the original address of ICMP messages net/mlx5e: Prefetch skb data on RX net/mlx5e: Pop cq outside mlx5e_get_cqe net/mlx5e: Remove mlx5e_cq.sqrq back-pointer net/mlx5e: Remove extra spaces net/mlx5e: Avoid TX CQE generation if more xmit packets expected net/mlx5e: Avoid redundant dev_kfree_skb() upon NOP completion net/mlx5e: Remove re-assignment of wq type in mlx5e_enable_rq() net/mlx5e: Use skb_shinfo(skb)->gso_segs rather than counting them net/mlx5e: Static mapping of netdev priv resources to/from netdev TX queues net/mlx4_en: Use HW counters for rx/tx bytes/packets in PF device ...
2015-06-22Merge branch 'timers-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "A rather largish update for everything time and timer related: - Cache footprint optimizations for both hrtimers and timer wheel - Lower the NOHZ impact on systems which have NOHZ or timer migration disabled at runtime. - Optimize run time overhead of hrtimer interrupt by making the clock offset updates smarter - hrtimer cleanups and removal of restrictions to tackle some problems in sched/perf - Some more leap second tweaks - Another round of changes addressing the 2038 problem - First step to change the internals of clock event devices by introducing the necessary infrastructure - Allow constant folding for usecs/msecs_to_jiffies() - The usual pile of clockevent/clocksource driver updates The hrtimer changes contain updates to sched, perf and x86 as they depend on them plus changes all over the tree to cleanup API changes and redundant code, which got copied all over the place. The y2038 changes touch s390 to remove the last non 2038 safe code related to boot/persistant clock" * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits) clocksource: Increase dependencies of timer-stm32 to limit build wreckage timer: Minimize nohz off overhead timer: Reduce timer migration overhead if disabled timer: Stats: Simplify the flags handling timer: Replace timer base by a cpu index timer: Use hlist for the timer wheel hash buckets timer: Remove FIFO "guarantee" timers: Sanitize catchup_timer_jiffies() usage hrtimer: Allow hrtimer::function() to free the timer seqcount: Introduce raw_write_seqcount_barrier() seqcount: Rename write_seqcount_barrier() hrtimer: Fix hrtimer_is_queued() hole hrtimer: Remove HRTIMER_STATE_MIGRATE selftest: Timers: Avoid signal deadlock in leap-a-day timekeeping: Copy the shadow-timekeeper over the real timekeeper last clockevents: Check state instead of mode in suspend/resume path selftests: timers: Add leap-second timer edge testing to leap-a-day.c ntp: Do leapsecond adjustment in adjtimex read path time: Prevent early expiry of hrtimers[CLOCK_REALTIME] at the leap second edge ntp: Introduce and use SECS_PER_DAY macro instead of 86400 ...
2015-06-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/phy/amd-xgbe-phy.c drivers/net/wireless/iwlwifi/Kconfig include/net/mac80211.h iwlwifi/Kconfig and mac80211.h were both trivial overlapping changes. The drivers/net/phy/amd-xgbe-phy.c file got removed in 'net-next' and the bug fix that happened on the 'net' side is already integrated into the rest of the amd-xgbe driver. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-27net_sched: invoke ->attach() after setting dev->qdiscWANG Cong
For mq qdisc, we add per tx queue qdisc to root qdisc for display purpose, however, that happens too early, before the new dev->qdisc is finally set, this causes q->list points to an old root qdisc which is going to be freed right before assigning with a new one. Fix this by moving ->attach() after setting dev->qdisc. For the record, this fixes the following crash: ------------[ cut here ]------------ WARNING: CPU: 1 PID: 975 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98() list_del corruption. prev->next should be ffff8800d1998ae8, but was 6b6b6b6b6b6b6b6b CPU: 1 PID: 975 Comm: tc Not tainted 4.1.0-rc4+ #1019 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 0000000000000009 ffff8800d73fb928 ffffffff81a44e7f 0000000047574756 ffff8800d73fb978 ffff8800d73fb968 ffffffff810790da ffff8800cfc4cd20 ffffffff814e725b ffff8800d1998ae8 ffffffff82381250 0000000000000000 Call Trace: [<ffffffff81a44e7f>] dump_stack+0x4c/0x65 [<ffffffff810790da>] warn_slowpath_common+0x9c/0xb6 [<ffffffff814e725b>] ? __list_del_entry+0x5a/0x98 [<ffffffff81079162>] warn_slowpath_fmt+0x46/0x48 [<ffffffff81820eb0>] ? dev_graft_qdisc+0x5e/0x6a [<ffffffff814e725b>] __list_del_entry+0x5a/0x98 [<ffffffff814e72a7>] list_del+0xe/0x2d [<ffffffff81822f05>] qdisc_list_del+0x1e/0x20 [<ffffffff81820cd1>] qdisc_destroy+0x30/0xd6 [<ffffffff81822676>] qdisc_graft+0x11d/0x243 [<ffffffff818233c1>] tc_get_qdisc+0x1a6/0x1d4 [<ffffffff810b5eaf>] ? mark_lock+0x2e/0x226 [<ffffffff817ff8f5>] rtnetlink_rcv_msg+0x181/0x194 [<ffffffff817ff72e>] ? rtnl_lock+0x17/0x19 [<ffffffff817ff72e>] ? rtnl_lock+0x17/0x19 [<ffffffff817ff774>] ? __rtnl_unlock+0x17/0x17 [<ffffffff81855dc6>] netlink_rcv_skb+0x4d/0x93 [<ffffffff817ff756>] rtnetlink_rcv+0x26/0x2d [<ffffffff818544b2>] netlink_unicast+0xcb/0x150 [<ffffffff81161db9>] ? might_fault+0x59/0xa9 [<ffffffff81854f78>] netlink_sendmsg+0x4fa/0x51c [<ffffffff817d6e09>] sock_sendmsg_nosec+0x12/0x1d [<ffffffff817d8967>] sock_sendmsg+0x29/0x2e [<ffffffff817d8cf3>] ___sys_sendmsg+0x1b4/0x23a [<ffffffff8100a1b8>] ? native_sched_clock+0x35/0x37 [<ffffffff810a1d83>] ? sched_clock_local+0x12/0x72 [<ffffffff810a1fd4>] ? sched_clock_cpu+0x9e/0xb7 [<ffffffff810def2a>] ? current_kernel_time+0xe/0x32 [<ffffffff810b4bc5>] ? lock_release_holdtime.part.29+0x71/0x7f [<ffffffff810ddebf>] ? read_seqcount_begin.constprop.27+0x5f/0x76 [<ffffffff810b6292>] ? trace_hardirqs_on_caller+0x17d/0x199 [<ffffffff811b14d5>] ? __fget_light+0x50/0x78 [<ffffffff817d9808>] __sys_sendmsg+0x42/0x60 [<ffffffff817d9838>] SyS_sendmsg+0x12/0x1c [<ffffffff81a50e97>] system_call_fastpath+0x12/0x6f ---[ end trace ef29d3fb28e97ae7 ]--- For long term, we probably need to clean up the qdisc_graft() code in case it hides other bugs like this. Fixes: 95dc19299f74 ("pkt_sched: give visibility to mq slave qdiscs") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13net: sched: use counter to break reclassify loopsFlorian Westphal
Seems all we want here is to avoid endless 'goto reclassify' loop. tc_classify_compat even resets this counter when something other than TC_ACT_RECLASSIFY is returned, so this skb-counter doesn't break hypothetical loops induced by something other than perpetual TC_ACT_RECLASSIFY return values. skb_act_clone is now identical to skb_clone, so just use that. Tested with following (bogus) filter: tc filter add dev eth0 parent ffff: \ protocol ip u32 match u32 0 0 police rate 10Kbit burst \ 64000 mtu 1500 action reclassify Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-22net: sched: Use hrtimer_resolution instead of hrtimer_get_res()Thomas Gleixner
No point in converting a timespec now that the value is directly accessible. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Link: http://lkml.kernel.org/r/20150414203500.720623028@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-03-09net_sched: destroy proto tp when all filters are goneCong Wang
Kernel automatically creates a tp for each (kind, protocol, priority) tuple, which has handle 0, when we add a new filter, but it still is left there after we remove our own, unless we don't specify the handle (literally means all the filters under the tuple). For example this one is left: # tc filter show dev eth0 filter parent 8001: protocol arp pref 49152 basic The user-space is hard to clean up these for kernel because filters like u32 are organized in a complex way. So kernel is responsible to remove it after all filters are gone. Each type of filter has its own way to store the filters, so each type has to provide its way to check if all filters are gone. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <cwang@twopensource.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim<jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-13net: sched: fix skb->protocol use in case of accelerated vlan pathJiri Pirko
tc code implicitly considers skb->protocol even in case of accelerated vlan paths and expects vlan protocol type here. However, on rx path, if the vlan header was already stripped, skb->protocol contains value of next header. Similar situation is on tx path. So for skbs that use skb->vlan_tci for tagging, use skb->vlan_proto instead. Reported-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-21net: sched: initialize bstats syncpSabrina Dubroca
Use netdev_alloc_pcpu_stats to allocate percpu stats and initialize syncp. Fixes: 22e0f8b9322c "net: sched: make bstats per cpu and estimator RCU safe" Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-06net: sched: avoid costly atomic operation in fq_dequeue()Eric Dumazet
Standard qdisc API to setup a timer implies an atomic operation on every packet dequeue : qdisc_unthrottled() It turns out this is not really needed for FQ, as FQ has no concept of global qdisc throttling, being a qdisc handling many different flows, some of them can be throttled, while others are not. Fix is straightforward : add a 'bool throttle' to qdisc_watchdog_schedule_ns(), and remove calls to qdisc_unthrottled() in sch_fq. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-04net: sched: suspicious RCU usage in qdisc_watchdogJohn Fastabend
Suspicious RCU usage in qdisc_watchdog call needs to be done inside rcu_read_lock/rcu_read_unlock. And then Qdisc destroy operations need to ensure timer is cancelled before removing qdisc structure. [ 3992.191339] =============================== [ 3992.191340] [ INFO: suspicious RCU usage. ] [ 3992.191343] 3.17.0-rc6net-next+ #72 Not tainted [ 3992.191345] ------------------------------- [ 3992.191347] include/net/sch_generic.h:272 suspicious rcu_dereference_check() usage! [ 3992.191348] [ 3992.191348] other info that might help us debug this: [ 3992.191348] [ 3992.191351] [ 3992.191351] rcu_scheduler_active = 1, debug_locks = 1 [ 3992.191353] no locks held by swapper/1/0. [ 3992.191355] [ 3992.191355] stack backtrace: [ 3992.191358] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.17.0-rc6net-next+ #72 [ 3992.191360] Hardware name: /DZ77RE-75K, BIOS GAZ7711H.86A.0060.2012.1115.1750 11/15/2012 [ 3992.191362] 0000000000000001 ffff880235803e48 ffffffff8178f92c 0000000000000000 [ 3992.191366] ffff8802322224a0 ffff880235803e78 ffffffff810c9966 ffff8800a5fe3000 [ 3992.191370] ffff880235803f30 ffff8802359cd768 ffff8802359cd6e0 ffff880235803e98 [ 3992.191374] Call Trace: [ 3992.191376] <IRQ> [<ffffffff8178f92c>] dump_stack+0x4e/0x68 [ 3992.191387] [<ffffffff810c9966>] lockdep_rcu_suspicious+0xe6/0x130 [ 3992.191392] [<ffffffff8167213a>] qdisc_watchdog+0x8a/0xb0 [ 3992.191396] [<ffffffff810f93f2>] __run_hrtimer+0x72/0x420 [ 3992.191399] [<ffffffff810f9bcd>] ? hrtimer_interrupt+0x7d/0x240 [ 3992.191403] [<ffffffff816720b0>] ? tc_classify+0xc0/0xc0 [ 3992.191406] [<ffffffff810f9c4f>] hrtimer_interrupt+0xff/0x240 [ 3992.191410] [<ffffffff8109e4a5>] ? __atomic_notifier_call_chain+0x5/0x140 [ 3992.191415] [<ffffffff8103577b>] local_apic_timer_interrupt+0x3b/0x60 [ 3992.191419] [<ffffffff8179c2b5>] smp_apic_timer_interrupt+0x45/0x60 [ 3992.191422] [<ffffffff8179a6bf>] apic_timer_interrupt+0x6f/0x80 [ 3992.191424] <EOI> [<ffffffff815ed233>] ? cpuidle_enter_state+0x73/0x2e0 [ 3992.191432] [<ffffffff815ed22e>] ? cpuidle_enter_state+0x6e/0x2e0 [ 3992.191437] [<ffffffff815ed567>] cpuidle_enter+0x17/0x20 [ 3992.191441] [<ffffffff810c0741>] cpu_startup_entry+0x3d1/0x4a0 [ 3992.191445] [<ffffffff81106fc6>] ? clockevents_config_and_register+0x26/0x30 [ 3992.191448] [<ffffffff81033c16>] start_secondary+0x1b6/0x260 Fixes: b26b0d1e8b1 ("net: qdisc: use rcu prefix and silence sparse warnings") Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-30net: sched: enable per cpu qstatsJohn Fastabend
After previous patches to simplify qstats the qstats can be made per cpu with a packed union in Qdisc struct. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-30net: sched: restrict use of qstats qlenJohn Fastabend
This removes the use of qstats->qlen variable from the classifiers and makes it an explicit argument to gnet_stats_copy_queue(). The qlen represents the qdisc queue length and is packed into the qstats at the last moment before passnig to user space. By handling it explicitely we avoid, in the percpu stats case, having to figure out which per_cpu variable to put it in. It would probably be best to remove it from qstats completely but qstats is a user space ABI and can't be broken. A future patch could make an internal only qstats structure that would avoid having to allocate an additional u32 variable on the Qdisc struct. This would make the qstats struct 128bits instead of 128+32. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-30net: sched: implement qstat helper routinesJohn Fastabend
This adds helpers to manipulate qstats logic and replaces locations that touch the counters directly. This simplifies future patches to push qstats onto per cpu counters. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-30net: sched: make bstats per cpu and estimator RCU safeJohn Fastabend
In order to run qdisc's without locking statistics and estimators need to be handled correctly. To resolve bstats make the statistics per cpu. And because this is only needed for qdiscs that are running without locks which is not the case for most qdiscs in the near future only create percpu stats when qdiscs set the TCQ_F_CPUSTATS flag. Next because estimators use the bstats to calculate packets per second and bytes per second the estimator code paths are updated to use the per cpu statistics. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26net: sched: use pinned timersEric Dumazet
While using a MQ + NETEM setup, I had confirmation that the default timer migration ( /proc/sys/kernel/timer_migration ) is killing us. Installing this on a receiver side of a TCP_STREAM test, (NIC has 8 TX queues) : EST="est 1sec 4sec" for ETH in eth1 do tc qd del dev $ETH root 2>/dev/null tc qd add dev $ETH root handle 1: mq tc qd add dev $ETH parent 1:1 $EST netem limit 70000 delay 6ms tc qd add dev $ETH parent 1:2 $EST netem limit 70000 delay 8ms tc qd add dev $ETH parent 1:3 $EST netem limit 70000 delay 10ms tc qd add dev $ETH parent 1:4 $EST netem limit 70000 delay 12ms tc qd add dev $ETH parent 1:5 $EST netem limit 70000 delay 14ms tc qd add dev $ETH parent 1:6 $EST netem limit 70000 delay 16ms tc qd add dev $ETH parent 1:7 $EST netem limit 80000 delay 18ms tc qd add dev $ETH parent 1:8 $EST netem limit 90000 delay 20ms done We can see that timers get migrated into a single cpu, presumably idle at the time timers are set up. Then all qdisc dequeues run from this cpu and huge lock contention happens. This single cpu is stuck in softirq mode and cannot dequeue fast enough. 39.24% [kernel] [k] _raw_spin_lock 2.65% [kernel] [k] netem_enqueue 1.80% [kernel] [k] netem_dequeue 1.63% [kernel] [k] copy_user_enhanced_fast_string 1.45% [kernel] [k] _raw_spin_lock_bh By pinning qdisc timers on the cpu running the qdisc, we respect proper XPS setting and remove this lock contention. 5.84% [kernel] [k] netem_enqueue 4.83% [kernel] [k] _raw_spin_lock 2.92% [kernel] [k] copy_user_enhanced_fast_string Current Qdiscs that benefit from this change are : netem, cbq, fq, hfsc, tbf, htb. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-13net: rcu-ify tcf_protoJohn Fastabend
rcu'ify tcf_proto this allows calling tc_classify() without holding any locks. Updaters are protected by RTNL. This patch prepares the core net_sched infrastracture for running the classifier/action chains without holding the qdisc lock however it does nothing to ensure cls_xxx and act_xxx types also work without locking. Additional patches are required to address the fall out. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11net_sched: drr: warn when qdisc is not work conservingFlorian Westphal
The DRR scheduler requires that items on the active list are work conserving, i.e. do not hold on to skbs for throttling purposes, etc. Attaching e.g. tbf renders DRR useless because all other classes on the active list are delayed as well. So, warn users that this configuration won't work as expected; we already do this in couple of other qdiscs, see e.g. commit b00355db3f88d96810a60011a30cfb2c3469409d ('pkt_sched: sch_hfsc: sch_htb: Add non-work-conserving warning handler') The 'const' change is needed to avoid compiler warning ("discards 'const' qualifier from pointer target type"). tested with: drr_hier() { parent=$1 classes=$2 for i in $(seq 1 $classes); do classid=$parent$(printf %x $i) tc class add dev eth0 parent $parent classid $classid drr tc qdisc add dev eth0 parent $classid tbf rate 64kbit burst 256kbit limit 64kbit done } tc qdisc add dev eth0 root handle 1: drr drr_hier 1: 32 tc filter add dev eth0 protocol all pref 1 parent 1: handle 1 flow hash keys dst perturb 1 divisor 32 Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/altera/altera_sgdma.c net/netlink/af_netlink.c net/sched/cls_api.c net/sched/sch_api.c The netlink conflict dealt with moving to netlink_capable() and netlink_ns_capable() in the 'net' tree vs. supporting 'tc' operations in non-init namespaces. These were simple transformations from netlink_capable to netlink_ns_capable. The Altera driver conflict was simply code removal overlapping some void pointer cast cleanups in net-next. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-02net: Allow tc changes in user namespacesStéphane Graber
This switches a few remaining capable(CAP_NET_ADMIN) to ns_capable so that root in a user namespace may set tc rules inside that namespace. Signed-off-by: Stéphane Graber <stgraber@ubuntu.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: "David S. Miller" <davem@davemloft.net> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-24net: Use netlink_ns_capable to verify the permisions of netlink messagesEric W. Biederman
It is possible by passing a netlink socket to a more privileged executable and then to fool that executable into writing to the socket data that happens to be valid netlink message to do something that privileged executable did not intend to do. To keep this from happening replace bare capable and ns_capable calls with netlink_capable, netlink_net_calls and netlink_ns_capable calls. Which act the same as the previous calls except they verify that the opener of the socket had the desired permissions as well. Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/usb/r8152.c drivers/net/xen-netback/netback.c Both the r8152 and netback conflicts were simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-11pkt_sched: add cond_resched() to class and qdisc dumpEric Dumazet
We have seen delays of more than 50ms in class or qdisc dumps, in case device is under high TX stress, even with the prior 4KB per skb limit. Add cond_resched() to give a chance to higher prio tasks to get cpu. Signed-off-by; Eric Dumazet <edumazet@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-11pkt_sched: do not use rcu in tc_dump_qdisc()Eric Dumazet
Like all rtnetlink dump operations, we hold RTNL in tc_dump_qdisc(), so we do not need to use rcu protection to protect list of netdevices. This will allow preemption to occur, thus reducing latencies. Following patch adds explicit cond_resched() calls. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-10pkt_sched: move the sanity test in qdisc_list_add()Eric Dumazet
The WARN_ON(root == &noop_qdisc)) added in qdisc_list_add() can trigger in normal conditions when devices are not up. It should be done only right before the list_add_tail() call. Fixes: e57a784d8cae4 ("pkt_sched: set root qdisc before change() in attach_default_qdiscs()") Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Tested-by: Mirco Tischler <mt-ml@gmx.de> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-31net, sch: fix the typo in register_qdisc()Zhi Yong Wu
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-14pkt_sched: set root qdisc before change() in attach_default_qdiscs()Eric Dumazet
After commit 95dc19299f74 ("pkt_sched: give visibility to mq slave qdiscs") we call disc_list_add() while the device qdisc might be the noop_qdisc one. This shows up as duplicates in "tc qdisc show", as all inactive devices point to noop_qdisc. Fix this by setting dev->qdisc to the new qdisc before calling ops->change() in attach_default_qdiscs() Add a WARN_ON_ONCE() to catch any future similar problem. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-09pkt_sched: give visibility to mq slave qdiscsEric Dumazet
Commit 6da7c8fcbcbd ("qdisc: allow setting default queuing discipline") added the ability to change default qdisc from pfifo_fast to say fq But as most modern ethernet devices are multiqueue, we cant really see all the statistics from "tc -s qdisc show", as the default root qdisc is mq. This patch adds the calls to qdisc_list_add() to mq and mqprio Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-08net_sched: increment drop counters in qdisc_tree_decrease_qlen()Eric Dumazet
qdisc_tree_decrease_qlen() is called when some packets are dropped on a qdisc, and we want to notify parents of qlen changes. We also can increment parents qdisc qstats drop counters. This permits more accurate drop counters up to root qdisc. For example a graft operation typically resets a qdisc (drops all packets) and call qdisc_tree_decrease_qlen() Note that callers are responsible for their drop counters. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-31qdisc: fix build with !CONFIG_NET_SCHEDstephen hemminger
Multiqueue scheduler refers to default_qdisc_ops; therefore the variable definition needs to be moved to handle case where net scheduler API is not available. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-31qdisc: allow setting default queuing disciplinestephen hemminger
By default, the pfifo_fast queue discipline has been used by default for all devices. But we have better choices now. This patch allow setting the default queueing discipline with sysctl. This allows easy use of better queueing disciplines on all devices without having to use tc qdisc scripts. It is intended to allow an easy path for distributions to make fq_codel or sfq the default qdisc. This patch also makes pfifo_fast more of a first class qdisc, since it is now possible to manually override the default and explicitly use pfifo_fast. The behavior for systems who do not use the sysctl is unchanged, they still get pfifo_fast Also removes leftover random # in sysctl net core. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-15net_sched: restore "linklayer atm" handlingJesper Dangaard Brouer
commit 56b765b79 ("htb: improved accuracy at high rates") broke the "linklayer atm" handling. tc class add ... htb rate X ceil Y linklayer atm The linklayer setting is implemented by modifying the rate table which is send to the kernel. No direct parameter were transferred to the kernel indicating the linklayer setting. The commit 56b765b79 ("htb: improved accuracy at high rates") removed the use of the rate table system. To keep compatible with older iproute2 utils, this patch detects the linklayer by parsing the rate table. It also supports future versions of iproute2 to send this linklayer parameter to the kernel directly. This is done by using the __reserved field in struct tc_ratespec, to convey the choosen linklayer option, but only using the lower 4 bits of this field. Linklayer detection is limited to speeds below 100Mbit/s, because at high rates the rtab is gets too inaccurate, so bad that several fields contain the same values, this resembling the ATM detect. Fields even start to contain "0" time to send, e.g. at 1000Mbit/s sending a 96 bytes packet cost "0", thus the rtab have been more broken than we first realized. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-07net_sched: qdisc_get_rtab() must check data[] arrayEric Dumazet
qdisc_get_rtab() should check not only the keys in struct tc_ratespec, but also the full data[] array. "tc ... linklayer atm " only perturbs values in the 256 slots array. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-28net-next: replace obsolete NLMSG_* with type safe nlmsg_*Hong zhi guo
Signed-off-by: Hong Zhiguo <honkiko@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-26netlink: have length check of rtnl msg before derefHong zhi guo
When the legacy array rtm_min still exists, the length check within these functions is covered by rtm_min[RTM_NEWTFILTER], rtm_min[RTM_NEWQDISC] and rtm_min[RTM_NEWTCLASS]. But after Thomas Graf removed rtm_min several days ago, these checks are missing. Other doit functions should be OK. Signed-off-by: Hong Zhiguo <honkiko@gmail.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-22rtnetlink: Remove passing of attributes into rtnl_doit functionsThomas Graf
With decnet converted, we can finally get rid of rta_buf and its computations around it. It also gets rid of the minimal header length verification since all message handlers do that explicitly anyway. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-27hlist: drop the node parameter from iteratorsSasha Levin
I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S | hlist_for_each_entry_continue(a, - b, c) S | hlist_for_each_entry_from(a, - b, c) S | hlist_for_each_entry_rcu(a, - b, c, d) S | hlist_for_each_entry_rcu_bh(a, - b, c, d) S | hlist_for_each_entry_continue_rcu_bh(a, - b, c) S | for_each_busy_worker(a, c, - b, d) S | ax25_uid_for_each(a, - b, c) S | ax25_for_each(a, - b, c) S | inet_bind_bucket_for_each(a, - b, c) S | sctp_for_each_hentry(a, - b, c) S | sk_for_each(a, - b, c) S | sk_for_each_rcu(a, - b, c) S | sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S | sk_for_each_safe(a, - b, c, d) S | sk_for_each_bound(a, - b, c) S | hlist_for_each_entry_safe(a, - b, c, d, e) S | hlist_for_each_entry_continue_rcu(a, - b, c) S | nr_neigh_for_each(a, - b, c) S | nr_neigh_for_each_safe(a, - b, c, d) S | nr_node_for_each(a, - b, c) S | nr_node_for_each_safe(a, - b, c, d) S | - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S | - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S | for_each_host(a, - b, c) S | for_each_host_safe(a, - b, c, d) S | for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-18net: proc: change proc_net_remove to remove_proc_entryGao feng
proc_net_remove is only used to remove proc entries that under /proc/net,it's not a general function for removing proc entries of netns. if we want to remove some proc entries which under /proc/net/stat/, we still need to call remove_proc_entry. this patch use remove_proc_entry to replace proc_net_remove. we can remove proc_net_remove after this patch. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-18net: proc: change proc_net_fops_create to proc_createGao feng
Right now, some modules such as bonding use proc_create to create proc entries under /proc/net/, and other modules such as ipv4 use proc_net_fops_create. It looks a little chaos.this patch changes all of proc_net_fops_create to proc_create. we can remove proc_net_fops_create after this patch. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>