summaryrefslogtreecommitdiffstats
path: root/tools/perf/bench
AgeCommit message (Collapse)Author
2020-11-12tools arch: Update arch/x86/lib/mem{cpy,set}_64.S copies used in 'perf bench ↵Arnaldo Carvalho de Melo
mem memcpy' To bring in the change made in this cset: 4d6ffa27b8e5116c ("x86/lib: Change .weak to SYM_FUNC_START_WEAK for arch/x86/lib/mem*_64.S") 6dcc5627f6aec4cb ("x86/asm: Change all ENTRY+ENDPROC to SYM_FUNC_*") I needed to define SYM_FUNC_START_LOCAL() as SYM_L_GLOBAL as mem{cpy,set}_{orig,erms} are used by 'perf bench'. This silences these perf tools build warnings: Warning: Kernel ABI header at 'tools/arch/x86/lib/memcpy_64.S' differs from latest version at 'arch/x86/lib/memcpy_64.S' diff -u tools/arch/x86/lib/memcpy_64.S arch/x86/lib/memcpy_64.S Warning: Kernel ABI header at 'tools/arch/x86/lib/memset_64.S' differs from latest version at 'arch/x86/lib/memset_64.S' diff -u tools/arch/x86/lib/memset_64.S arch/x86/lib/memset_64.S Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Borislav Petkov <bp@suse.de> Cc: Fangrui Song <maskray@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-10-17Merge tag 'perf-tools-for-v5.10-2020-10-15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux Pull perf tools updates from Arnaldo Carvalho de Melo: - cgroup improvements for 'perf stat', allowing for compact specification of events and cgroups in the command line. - Support per thread topdown metrics in 'perf stat'. - Support sample-read topdown metric group in 'perf record' - Show start of latency in addition to its start in 'perf sched latency'. - Add min, max to 'perf script' futex-contention output, in addition to avg. - Allow usage of 'perf_event_attr->exclusive' attribute via the new ':e' event modifier. - Add 'snapshot' command to 'perf record --control', using it with Intel PT. - Support FIFO file names as alternative options to 'perf record --control'. - Introduce branch history "streams", to compare 'perf record' runs with 'perf diff' based on branch records and report hot streams. - Support PE executable symbol tables using libbfd, to profile, for instance, wine binaries. - Add filter support for option 'perf ftrace -F/--funcs'. - Allow configuring the 'disassembler_style' 'perf annotate' knob via 'perf config' - Update CascadelakeX and SkylakeX JSON vendor events files. - Add support for parsing perchip/percore JSON vendor events. - Add power9 hv_24x7 core level metric events. - Add L2 prefetch, ITLB instruction fetch hits JSON events for AMD zen1. - Enable Family 19h users by matching Zen2 AMD vendor events. - Use debuginfod in 'perf probe' when required debug files not found locally. - Display negative tid in non-sample events in 'perf script'. - Make GTK2 support opt-in - Add build test with GTK+ - Add missing -lzstd to the fast path feature detection - Add scripts to auto generate 'mmap', 'mremap' string<->id tables for use in 'perf trace'. - Show python test script in verbose mode. - Fix uncore metric expressions - Msan uninitialized use fixes. - Use condition variables in 'perf bench numa' - Autodetect python3 binary in systems without python2. - Support md5 build ids in addition to sha1. - Add build id 'perf test' regression test. - Fix printable strings in python3 scripts. - Fix off by ones in 'perf trace' in arches using libaudit. - Fix JSON event code for events referencing std arch events. - Introduce 'perf test' shell script for Arm CoreSight testing. - Add rdtsc() for Arm64 for used in the PERF_RECORD_TIME_CONV metadata event and in 'perf test tsc'. - 'perf c2c' improvements: Add "RMT Load Hit" metric, "Total Stores", fixes and documentation update. - Fix usage of reloc_sym in 'perf probe' when using both kallsyms and debuginfo files. - Do not print 'Metric Groups:' unnecessarily in 'perf list' - Refcounting fixes in the event parsing code. - Add expand cgroup event 'perf test' entry. - Fix out of bounds CPU map access when handling armv8_pmu events in 'perf stat'. - Add build-id injection 'perf bench' benchmark. - Enter namespace when reading build-id in 'perf inject'. - Do not load map/dso when injecting build-id speeding up the 'perf inject' process. - Add --buildid-all option to avoid processing all samples, just the mmap metadata events. - Add feature test to check if libbfd has buildid support - Add 'perf test' entry for PE binary format support. - Fix typos in power8 PMU vendor events JSON files. - Hide libtraceevent non API functions. * tag 'perf-tools-for-v5.10-2020-10-15' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (113 commits) perf c2c: Update documentation for metrics reorganization perf c2c: Add metrics "RMT Load Hit" perf c2c: Correct LLC load hit metrics perf c2c: Change header for LLC local hit perf c2c: Use more explicit headers for HITM perf c2c: Change header from "LLC Load Hitm" to "Load Hitm" perf c2c: Organize metrics based on memory hierarchy perf c2c: Display "Total Stores" as a standalone metrics perf c2c: Display the total numbers continuously perf bench: Use condition variables in numa. perf jevents: Fix event code for events referencing std arch events perf diff: Support hot streams comparison perf streams: Report hot streams perf streams: Calculate the sum of total streams hits perf streams: Link stream pair perf streams: Compare two streams perf streams: Get the evsel_streams by evsel_idx perf streams: Introduce branch history "streams" perf intel-pt: Improve PT documentation slightly perf tools: Add support for exclusive groups/events ...
2020-10-14perf bench: Use condition variables in numa.Ian Rogers
The existing approach to synchronization between threads in the numa benchmark is unbalanced mutexes. This synchronization causes thread sanitizer to warn of locks being taken twice on a thread without an unlock, as well as unlocks with no corresponding locks. This change replaces the synchronization with more regular condition variables. While this fixes one class of thread sanitizer warnings, there still remain warnings of data races due to threads reading and writing shared memory without any atomics. Committer testing: Basic run on a non-NUMA machine. # perf bench numa # List of available benchmarks for collection 'numa': mem: Benchmark for NUMA workloads all: Run all NUMA benchmarks # perf bench numa all # Running numa/mem benchmark... # Running main, "perf bench numa numa-mem" # # Running test on: Linux five 5.8.12-200.fc32.x86_64 #1 SMP Mon Sep 28 12:17:31 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux # # Running RAM-bw-local, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 0 -s 20 -zZq --thp 1 --no-data_rand_walk" 20.076 secs slowest (max) thread-runtime 20.000 secs fastest (min) thread-runtime 20.073 secs average thread-runtime 0.190 % difference between max/avg runtime 241.828 GB data processed, per thread 241.828 GB data processed, total 0.083 nsecs/byte/thread runtime 12.045 GB/sec/thread speed 12.045 GB/sec total speed # Running RAM-bw-local-NOTHP, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 0 -s 20 -zZq --thp 1 --no-data_rand_walk --thp -1" 20.045 secs slowest (max) thread-runtime 20.000 secs fastest (min) thread-runtime 20.014 secs average thread-runtime 0.111 % difference between max/avg runtime 234.304 GB data processed, per thread 234.304 GB data processed, total 0.086 nsecs/byte/thread runtime 11.689 GB/sec/thread speed 11.689 GB/sec total speed # Running RAM-bw-remote, "perf bench numa mem -p 1 -t 1 -P 1024 -C 0 -M 1 -s 20 -zZq --thp 1 --no-data_rand_walk" Test not applicable, system has only 1 nodes. # Running RAM-bw-local-2x, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,2 -M 0x2 -s 20 -zZq --thp 1 --no-data_rand_walk" 20.138 secs slowest (max) thread-runtime 20.000 secs fastest (min) thread-runtime 20.121 secs average thread-runtime 0.342 % difference between max/avg runtime 135.961 GB data processed, per thread 271.922 GB data processed, total 0.148 nsecs/byte/thread runtime 6.752 GB/sec/thread speed 13.503 GB/sec total speed # Running RAM-bw-remote-2x, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,2 -M 1x2 -s 20 -zZq --thp 1 --no-data_rand_walk" Test not applicable, system has only 1 nodes. # Running RAM-bw-cross, "perf bench numa mem -p 2 -t 1 -P 1024 -C 0,8 -M 1,0 -s 20 -zZq --thp 1 --no-data_rand_walk" Test not applicable, system has only 1 nodes. # Running 1x3-convergence, "perf bench numa mem -p 1 -t 3 -P 512 -s 100 -zZ0qcm --thp 1" 0.747 secs latency to NUMA-converge 0.747 secs slowest (max) thread-runtime 0.000 secs fastest (min) thread-runtime 0.714 secs average thread-runtime 50.000 % difference between max/avg runtime 3.228 GB data processed, per thread 9.683 GB data processed, total 0.231 nsecs/byte/thread runtime 4.321 GB/sec/thread speed 12.964 GB/sec total speed # Running 1x4-convergence, "perf bench numa mem -p 1 -t 4 -P 512 -s 100 -zZ0qcm --thp 1" 1.127 secs latency to NUMA-converge 1.127 secs slowest (max) thread-runtime 1.000 secs fastest (min) thread-runtime 1.089 secs average thread-runtime 5.624 % difference between max/avg runtime 3.765 GB data processed, per thread 15.062 GB data processed, total 0.299 nsecs/byte/thread runtime 3.342 GB/sec/thread speed 13.368 GB/sec total speed # Running 1x6-convergence, "perf bench numa mem -p 1 -t 6 -P 1020 -s 100 -zZ0qcm --thp 1" 1.003 secs latency to NUMA-converge 1.003 secs slowest (max) thread-runtime 0.000 secs fastest (min) thread-runtime 0.889 secs average thread-runtime 50.000 % difference between max/avg runtime 2.141 GB data processed, per thread 12.847 GB data processed, total 0.469 nsecs/byte/thread runtime 2.134 GB/sec/thread speed 12.805 GB/sec total speed # Running 2x3-convergence, "perf bench numa mem -p 2 -t 3 -P 1020 -s 100 -zZ0qcm --thp 1" 1.814 secs latency to NUMA-converge 1.814 secs slowest (max) thread-runtime 1.000 secs fastest (min) thread-runtime 1.716 secs average thread-runtime 22.440 % difference between max/avg runtime 3.747 GB data processed, per thread 22.483 GB data processed, total 0.484 nsecs/byte/thread runtime 2.065 GB/sec/thread speed 12.393 GB/sec total speed # Running 3x3-convergence, "perf bench numa mem -p 3 -t 3 -P 1020 -s 100 -zZ0qcm --thp 1" 2.065 secs latency to NUMA-converge 2.065 secs slowest (max) thread-runtime 1.000 secs fastest (min) thread-runtime 1.947 secs average thread-runtime 25.788 % difference between max/avg runtime 2.855 GB data processed, per thread 25.694 GB data processed, total 0.723 nsecs/byte/thread runtime 1.382 GB/sec/thread speed 12.442 GB/sec total speed # Running 4x4-convergence, "perf bench numa mem -p 4 -t 4 -P 512 -s 100 -zZ0qcm --thp 1" 1.912 secs latency to NUMA-converge 1.912 secs slowest (max) thread-runtime 1.000 secs fastest (min) thread-runtime 1.775 secs average thread-runtime 23.852 % difference between max/avg runtime 1.479 GB data processed, per thread 23.668 GB data processed, total 1.293 nsecs/byte/thread runtime 0.774 GB/sec/thread speed 12.378 GB/sec total speed # Running 4x4-convergence-NOTHP, "perf bench numa mem -p 4 -t 4 -P 512 -s 100 -zZ0qcm --thp 1 --thp -1" 1.783 secs latency to NUMA-converge 1.783 secs slowest (max) thread-runtime 1.000 secs fastest (min) thread-runtime 1.633 secs average thread-runtime 21.960 % difference between max/avg runtime 1.345 GB data processed, per thread 21.517 GB data processed, total 1.326 nsecs/byte/thread runtime 0.754 GB/sec/thread speed 12.067 GB/sec total speed # Running 4x6-convergence, "perf bench numa mem -p 4 -t 6 -P 1020 -s 100 -zZ0qcm --thp 1" 5.396 secs latency to NUMA-converge 5.396 secs slowest (max) thread-runtime 4.000 secs fastest (min) thread-runtime 4.928 secs average thread-runtime 12.937 % difference between max/avg runtime 2.721 GB data processed, per thread 65.306 GB data processed, total 1.983 nsecs/byte/thread runtime 0.504 GB/sec/thread speed 12.102 GB/sec total speed # Running 4x8-convergence, "perf bench numa mem -p 4 -t 8 -P 512 -s 100 -zZ0qcm --thp 1" 3.121 secs latency to NUMA-converge 3.121 secs slowest (max) thread-runtime 2.000 secs fastest (min) thread-runtime 2.836 secs average thread-runtime 17.962 % difference between max/avg runtime 1.194 GB data processed, per thread 38.192 GB data processed, total 2.615 nsecs/byte/thread runtime 0.382 GB/sec/thread speed 12.236 GB/sec total speed # Running 8x4-convergence, "perf bench numa mem -p 8 -t 4 -P 512 -s 100 -zZ0qcm --thp 1" 4.302 secs latency to NUMA-converge 4.302 secs slowest (max) thread-runtime 3.000 secs fastest (min) thread-runtime 4.045 secs average thread-runtime 15.133 % difference between max/avg runtime 1.631 GB data processed, per thread 52.178 GB data processed, total 2.638 nsecs/byte/thread runtime 0.379 GB/sec/thread speed 12.128 GB/sec total speed # Running 8x4-convergence-NOTHP, "perf bench numa mem -p 8 -t 4 -P 512 -s 100 -zZ0qcm --thp 1 --thp -1" 4.418 secs latency to NUMA-converge 4.418 secs slowest (max) thread-runtime 3.000 secs fastest (min) thread-runtime 4.104 secs average thread-runtime 16.045 % difference between max/avg runtime 1.664 GB data processed, per thread 53.254 GB data processed, total 2.655 nsecs/byte/thread runtime 0.377 GB/sec/thread speed 12.055 GB/sec total speed # Running 3x1-convergence, "perf bench numa mem -p 3 -t 1 -P 512 -s 100 -zZ0qcm --thp 1" 0.973 secs latency to NUMA-converge 0.973 secs slowest (max) thread-runtime 0.000 secs fastest (min) thread-runtime 0.955 secs average thread-runtime 50.000 % difference between max/avg runtime 4.124 GB data processed, per thread 12.372 GB data processed, total 0.236 nsecs/byte/thread runtime 4.238 GB/sec/thread speed 12.715 GB/sec total speed # Running 4x1-convergence, "perf bench numa mem -p 4 -t 1 -P 512 -s 100 -zZ0qcm --thp 1" 0.820 secs latency to NUMA-converge 0.820 secs slowest (max) thread-runtime 0.000 secs fastest (min) thread-runtime 0.808 secs average thread-runtime 50.000 % difference between max/avg runtime 2.555 GB data processed, per thread 10.220 GB data processed, total 0.321 nsecs/byte/thread runtime 3.117 GB/sec/thread speed 12.468 GB/sec total speed # Running 8x1-convergence, "perf bench numa mem -p 8 -t 1 -P 512 -s 100 -zZ0qcm --thp 1" 0.667 secs latency to NUMA-converge 0.667 secs slowest (max) thread-runtime 0.000 secs fastest (min) thread-runtime 0.607 secs average thread-runtime 50.000 % difference between max/avg runtime 1.009 GB data processed, per thread 8.069 GB data processed, total 0.661 nsecs/byte/thread runtime 1.512 GB/sec/thread speed 12.095 GB/sec total speed # Running 16x1-convergence, "perf bench numa mem -p 16 -t 1 -P 256 -s 100 -zZ0qcm --thp 1" 1.546 secs latency to NUMA-converge 1.546 secs slowest (max) thread-runtime 1.000 secs fastest (min) thread-runtime 1.485 secs average thread-runtime 17.664 % difference between max/avg runtime 1.162 GB data processed, per thread 18.594 GB data processed, total 1.331 nsecs/byte/thread runtime 0.752 GB/sec/thread speed 12.025 GB/sec total speed # Running 32x1-convergence, "perf bench numa mem -p 32 -t 1 -P 128 -s 100 -zZ0qcm --thp 1" 0.812 secs latency to NUMA-converge 0.812 secs slowest (max) thread-runtime 0.000 secs fastest (min) thread-runtime 0.739 secs average thread-runtime 50.000 % difference between max/avg runtime 0.309 GB data processed, per thread 9.874 GB data processed, total 2.630 nsecs/byte/thread runtime 0.380 GB/sec/thread speed 12.166 GB/sec total speed # Running 2x1-bw-process, "perf bench numa mem -p 2 -t 1 -P 1024 -s 20 -zZ0q --thp 1" 20.044 secs slowest (max) thread-runtime 20.000 secs fastest (min) thread-runtime 20.020 secs average thread-runtime 0.109 % difference between max/avg runtime 125.750 GB data processed, per thread 251.501 GB data processed, total 0.159 nsecs/byte/thread runtime 6.274 GB/sec/thread speed 12.548 GB/sec total speed # Running 3x1-bw-process, "perf bench numa mem -p 3 -t 1 -P 1024 -s 20 -zZ0q --thp 1" 20.148 secs slowest (max) thread-runtime 20.000 secs fastest (min) thread-runtime 20.090 secs average thread-runtime 0.367 % difference between max/avg runtime 85.267 GB data processed, per thread 255.800 GB data processed, total 0.236 nsecs/byte/thread runtime 4.232 GB/sec/thread speed 12.696 GB/sec total speed # Running 4x1-bw-process, "perf bench numa mem -p 4 -t 1 -P 1024 -s 20 -zZ0q --thp 1" 20.169 secs slowest (max) thread-runtime 20.000 secs fastest (min) thread-runtime 20.100 secs average thread-runtime 0.419 % difference between max/avg runtime 63.144 GB data processed, per thread 252.576 GB data processed, total 0.319 nsecs/byte/thread runtime 3.131 GB/sec/thread speed 12.523 GB/sec total speed # Running 8x1-bw-process, "perf bench numa mem -p 8 -t 1 -P 512 -s 20 -zZ0q --thp 1" 20.175 secs slowest (max) thread-runtime 20.000 secs fastest (min) thread-runtime 20.107 secs average thread-runtime 0.433 % difference between max/avg runtime 31.267 GB data processed, per thread 250.133 GB data processed, total 0.645 nsecs/byte/thread runtime 1.550 GB/sec/thread speed 12.398 GB/sec total speed # Running 8x1-bw-process-NOTHP, "perf bench numa mem -p 8 -t 1 -P 512 -s 20 -zZ0q --thp 1 --thp -1" 20.216 secs slowest (max) thread-runtime 20.000 secs fastest (min) thread-runtime 20.113 secs average thread-runtime 0.535 % difference between max/avg runtime 30.998 GB data processed, per thread 247.981 GB data processed, total 0.652 nsecs/byte/thread runtime 1.533 GB/sec/thread speed 12.266 GB/sec total speed # Running 16x1-bw-process, "perf bench numa mem -p 16 -t 1 -P 256 -s 20 -zZ0q --thp 1" 20.234 secs slowest (max) thread-runtime 20.000 secs fastest (min) thread-runtime 20.174 secs average thread-runtime 0.577 % difference between max/avg runtime 15.377 GB data processed, per thread 246.039 GB data processed, total 1.316 nsecs/byte/thread runtime 0.760 GB/sec/thread speed 12.160 GB/sec total speed # Running 1x4-bw-thread, "perf bench numa mem -p 1 -t 4 -T 256 -s 20 -zZ0q --thp 1" 20.040 secs slowest (max) thread-runtime 20.000 secs fastest (min) thread-runtime 20.028 secs average thread-runtime 0.099 % difference between max/avg runtime 66.832 GB data processed, per thread 267.328 GB data processed, total 0.300 nsecs/byte/thread runtime 3.335 GB/sec/thread speed 13.340 GB/sec total speed # Running 1x8-bw-thread, "perf bench numa mem -p 1 -t 8 -T 256 -s 20 -zZ0q --thp 1" 20.064 secs slowest (max) thread-runtime 20.000 secs fastest (min) thread-runtime 20.034 secs average thread-runtime 0.160 % difference between max/avg runtime 32.911 GB data processed, per thread 263.286 GB data processed, total 0.610 nsecs/byte/thread runtime 1.640 GB/sec/thread speed 13.122 GB/sec total speed # Running 1x16-bw-thread, "perf bench numa mem -p 1 -t 16 -T 128 -s 20 -zZ0q --thp 1" 20.092 secs slowest (max) thread-runtime 20.000 secs fastest (min) thread-runtime 20.052 secs average thread-runtime 0.230 % difference between max/avg runtime 16.131 GB data processed, per thread 258.088 GB data processed, total 1.246 nsecs/byte/thread runtime 0.803 GB/sec/thread speed 12.845 GB/sec total speed # Running 1x32-bw-thread, "perf bench numa mem -p 1 -t 32 -T 64 -s 20 -zZ0q --thp 1" 20.099 secs slowest (max) thread-runtime 20.000 secs fastest (min) thread-runtime 20.063 secs average thread-runtime 0.247 % difference between max/avg runtime 7.962 GB data processed, per thread 254.773 GB data processed, total 2.525 nsecs/byte/thread runtime 0.396 GB/sec/thread speed 12.676 GB/sec total speed # Running 2x3-bw-process, "perf bench numa mem -p 2 -t 3 -P 512 -s 20 -zZ0q --thp 1" 20.150 secs slowest (max) thread-runtime 20.000 secs fastest (min) thread-runtime 20.120 secs average thread-runtime 0.372 % difference between max/avg runtime 44.827 GB data processed, per thread 268.960 GB data processed, total 0.450 nsecs/byte/thread runtime 2.225 GB/sec/thread speed 13.348 GB/sec total speed # Running 4x4-bw-process, "perf bench numa mem -p 4 -t 4 -P 512 -s 20 -zZ0q --thp 1" 20.258 secs slowest (max) thread-runtime 20.000 secs fastest (min) thread-runtime 20.168 secs average thread-runtime 0.636 % difference between max/avg runtime 17.079 GB data processed, per thread 273.263 GB data processed, total 1.186 nsecs/byte/thread runtime 0.843 GB/sec/thread speed 13.489 GB/sec total speed # Running 4x6-bw-process, "perf bench numa mem -p 4 -t 6 -P 512 -s 20 -zZ0q --thp 1" 20.559 secs slowest (max) thread-runtime 20.000 secs fastest (min) thread-runtime 20.382 secs average thread-runtime 1.359 % difference between max/avg runtime 10.758 GB data processed, per thread 258.201 GB data processed, total 1.911 nsecs/byte/thread runtime 0.523 GB/sec/thread speed 12.559 GB/sec total speed # Running 4x8-bw-process, "perf bench numa mem -p 4 -t 8 -P 512 -s 20 -zZ0q --thp 1" 20.744 secs slowest (max) thread-runtime 20.000 secs fastest (min) thread-runtime 20.516 secs average thread-runtime 1.792 % difference between max/avg runtime 8.069 GB data processed, per thread 258.201 GB data processed, total 2.571 nsecs/byte/thread runtime 0.389 GB/sec/thread speed 12.447 GB/sec total speed # Running 4x8-bw-process-NOTHP, "perf bench numa mem -p 4 -t 8 -P 512 -s 20 -zZ0q --thp 1 --thp -1" 20.855 secs slowest (max) thread-runtime 20.000 secs fastest (min) thread-runtime 20.561 secs average thread-runtime 2.050 % difference between max/avg runtime 8.069 GB data processed, per thread 258.201 GB data processed, total 2.585 nsecs/byte/thread runtime 0.387 GB/sec/thread speed 12.381 GB/sec total speed # Running 3x3-bw-process, "perf bench numa mem -p 3 -t 3 -P 512 -s 20 -zZ0q --thp 1" 20.134 secs slowest (max) thread-runtime 20.000 secs fastest (min) thread-runtime 20.077 secs average thread-runtime 0.333 % difference between max/avg runtime 28.091 GB data processed, per thread 252.822 GB data processed, total 0.717 nsecs/byte/thread runtime 1.395 GB/sec/thread speed 12.557 GB/sec total speed # Running 5x5-bw-process, "perf bench numa mem -p 5 -t 5 -P 512 -s 20 -zZ0q --thp 1" 20.588 secs slowest (max) thread-runtime 20.000 secs fastest (min) thread-runtime 20.375 secs average thread-runtime 1.427 % difference between max/avg runtime 10.177 GB data processed, per thread 254.436 GB data processed, total 2.023 nsecs/byte/thread runtime 0.494 GB/sec/thread speed 12.359 GB/sec total speed # Running 2x16-bw-process, "perf bench numa mem -p 2 -t 16 -P 512 -s 20 -zZ0q --thp 1" 20.657 secs slowest (max) thread-runtime 20.000 secs fastest (min) thread-runtime 20.429 secs average thread-runtime 1.589 % difference between max/avg runtime 8.170 GB data processed, per thread 261.429 GB data processed, total 2.528 nsecs/byte/thread runtime 0.395 GB/sec/thread speed 12.656 GB/sec total speed # Running 1x32-bw-process, "perf bench numa mem -p 1 -t 32 -P 2048 -s 20 -zZ0q --thp 1" 22.981 secs slowest (max) thread-runtime 20.000 secs fastest (min) thread-runtime 21.996 secs average thread-runtime 6.486 % difference between max/avg runtime 8.863 GB data processed, per thread 283.606 GB data processed, total 2.593 nsecs/byte/thread runtime 0.386 GB/sec/thread speed 12.341 GB/sec total speed # Running numa02-bw, "perf bench numa mem -p 1 -t 32 -T 32 -s 20 -zZ0q --thp 1" 20.047 secs slowest (max) thread-runtime 19.000 secs fastest (min) thread-runtime 20.026 secs average thread-runtime 2.611 % difference between max/avg runtime 8.441 GB data processed, per thread 270.111 GB data processed, total 2.375 nsecs/byte/thread runtime 0.421 GB/sec/thread speed 13.474 GB/sec total speed # Running numa02-bw-NOTHP, "perf bench numa mem -p 1 -t 32 -T 32 -s 20 -zZ0q --thp 1 --thp -1" 20.088 secs slowest (max) thread-runtime 19.000 secs fastest (min) thread-runtime 20.025 secs average thread-runtime 2.709 % difference between max/avg runtime 8.411 GB data processed, per thread 269.142 GB data processed, total 2.388 nsecs/byte/thread runtime 0.419 GB/sec/thread speed 13.398 GB/sec total speed # Running numa01-bw-thread, "perf bench numa mem -p 2 -t 16 -T 192 -s 20 -zZ0q --thp 1" 20.293 secs slowest (max) thread-runtime 20.000 secs fastest (min) thread-runtime 20.175 secs average thread-runtime 0.721 % difference between max/avg runtime 7.918 GB data processed, per thread 253.374 GB data processed, total 2.563 nsecs/byte/thread runtime 0.390 GB/sec/thread speed 12.486 GB/sec total speed # Running numa01-bw-thread-NOTHP, "perf bench numa mem -p 2 -t 16 -T 192 -s 20 -zZ0q --thp 1 --thp -1" 20.411 secs slowest (max) thread-runtime 20.000 secs fastest (min) thread-runtime 20.226 secs average thread-runtime 1.006 % difference between max/avg runtime 7.931 GB data processed, per thread 253.778 GB data processed, total 2.574 nsecs/byte/thread runtime 0.389 GB/sec/thread speed 12.434 GB/sec total speed # Signed-off-by: Ian Rogers <irogers@google.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/r/20201012161611.366482-1-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-10-14perf tools: Pass build_id object to filename__read_build_id()Jiri Olsa
Pass a build_id object to filename__read_build_id function, so it can populate the size of the build_id object. Changing filename__read_build_id() code for both ELF/non-ELF code. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20201013192441.1299447-3-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-10-14perf tools: Use build_id object in dsoJiri Olsa
Replace build_id byte array with struct build_id object and all the code that references it. The objective is to carry size together with build id array, so it's better to keep both together. This is preparatory change for following patches, and there's no functional change. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20201013192441.1299447-2-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-10-13perf bench: Run inject-build-id with --buildid-all option tooNamhyung Kim
For comparison, it now runs the benchmark twice - one if regular -b and another for --buildid-all. $ perf bench internals inject-build-id # Running 'internals/inject-build-id' benchmark: Average build-id injection took: 21.002 msec (+- 0.172 msec) Average time per event: 2.059 usec (+- 0.017 usec) Average memory usage: 8169 KB (+- 0 KB) Average build-id-all injection took: 19.543 msec (+- 0.124 msec) Average time per event: 1.916 usec (+- 0.012 usec) Average memory usage: 7348 KB (+- 0 KB) Signed-off-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by: Ian Rogers <irogers@google.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Link: https://lore.kernel.org/r/20201012070214.2074921-7-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-10-13perf bench: Add build-id injection benchmarkNamhyung Kim
Sometimes I can see that 'perf record' piped with 'perf inject' take a long time processing build-ids. So introduce a inject-build-id benchmark to the internals benchmark suite to measure its overhead regularly. It runs the 'perf inject' command internally and feeds the given number of synthesized events (MMAP2 + SAMPLE basically). Usage: perf bench internals inject-build-id <options> -i, --iterations <n> Number of iterations used to compute average (default: 100) -m, --nr-mmaps <n> Number of mmap events for each iteration (default: 100) -n, --nr-samples <n> Number of sample events per mmap event (default: 100) -v, --verbose be more verbose (show iteration count, DSO name, etc) By default, it measures average processing time of 100 MMAP2 events and 10000 SAMPLE events. Below is a result on my laptop. $ perf bench internals inject-build-id # Running 'internals/inject-build-id' benchmark: Average build-id injection took: 25.789 msec (+- 0.202 msec) Average time per event: 2.528 usec (+- 0.020 usec) Average memory usage: 8411 KB (+- 7 KB) Committer testing: $ perf bench Usage: perf bench [<common options>] <collection> <benchmark> [<options>] # List of all available benchmark collections: sched: Scheduler and IPC benchmarks syscall: System call benchmarks mem: Memory access benchmarks numa: NUMA scheduling and MM benchmarks futex: Futex stressing benchmarks epoll: Epoll stressing benchmarks internals: Perf-internals benchmarks all: All benchmarks $ perf bench internals # List of available benchmarks for collection 'internals': synthesize: Benchmark perf event synthesis kallsyms-parse: Benchmark kallsyms parsing inject-build-id: Benchmark build-id injection $ perf bench internals inject-build-id # Running 'internals/inject-build-id' benchmark: Average build-id injection took: 14.202 msec (+- 0.059 msec) Average time per event: 1.392 usec (+- 0.006 usec) Average memory usage: 12650 KB (+- 10 KB) Average build-id-all injection took: 12.831 msec (+- 0.071 msec) Average time per event: 1.258 usec (+- 0.007 usec) Average memory usage: 11895 KB (+- 10 KB) $ $ perf stat -r5 perf bench internals inject-build-id # Running 'internals/inject-build-id' benchmark: Average build-id injection took: 14.380 msec (+- 0.056 msec) Average time per event: 1.410 usec (+- 0.006 usec) Average memory usage: 12608 KB (+- 11 KB) Average build-id-all injection took: 11.889 msec (+- 0.064 msec) Average time per event: 1.166 usec (+- 0.006 usec) Average memory usage: 11838 KB (+- 10 KB) # Running 'internals/inject-build-id' benchmark: Average build-id injection took: 14.246 msec (+- 0.065 msec) Average time per event: 1.397 usec (+- 0.006 usec) Average memory usage: 12744 KB (+- 10 KB) Average build-id-all injection took: 12.019 msec (+- 0.066 msec) Average time per event: 1.178 usec (+- 0.006 usec) Average memory usage: 11963 KB (+- 10 KB) # Running 'internals/inject-build-id' benchmark: Average build-id injection took: 14.321 msec (+- 0.067 msec) Average time per event: 1.404 usec (+- 0.007 usec) Average memory usage: 12690 KB (+- 10 KB) Average build-id-all injection took: 11.909 msec (+- 0.041 msec) Average time per event: 1.168 usec (+- 0.004 usec) Average memory usage: 11938 KB (+- 10 KB) # Running 'internals/inject-build-id' benchmark: Average build-id injection took: 14.287 msec (+- 0.059 msec) Average time per event: 1.401 usec (+- 0.006 usec) Average memory usage: 12864 KB (+- 10 KB) Average build-id-all injection took: 11.862 msec (+- 0.058 msec) Average time per event: 1.163 usec (+- 0.006 usec) Average memory usage: 12103 KB (+- 10 KB) # Running 'internals/inject-build-id' benchmark: Average build-id injection took: 14.402 msec (+- 0.053 msec) Average time per event: 1.412 usec (+- 0.005 usec) Average memory usage: 12876 KB (+- 10 KB) Average build-id-all injection took: 11.826 msec (+- 0.061 msec) Average time per event: 1.159 usec (+- 0.006 usec) Average memory usage: 12111 KB (+- 10 KB) Performance counter stats for 'perf bench internals inject-build-id' (5 runs): 4,267.48 msec task-clock:u # 1.502 CPUs utilized ( +- 0.14% ) 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 102,092 page-faults:u # 0.024 M/sec ( +- 0.08% ) 3,894,589,578 cycles:u # 0.913 GHz ( +- 0.19% ) (83.49%) 140,078,421 stalled-cycles-frontend:u # 3.60% frontend cycles idle ( +- 0.77% ) (83.34%) 948,581,189 stalled-cycles-backend:u # 24.36% backend cycles idle ( +- 0.46% ) (83.25%) 5,835,587,719 instructions:u # 1.50 insn per cycle # 0.16 stalled cycles per insn ( +- 0.21% ) (83.24%) 1,267,423,636 branches:u # 296.996 M/sec ( +- 0.22% ) (83.12%) 17,484,290 branch-misses:u # 1.38% of all branches ( +- 0.12% ) (83.55%) 2.84176 +- 0.00222 seconds time elapsed ( +- 0.08% ) $ Acked-by: Jiri Olsa <jolsa@redhat.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20201012070214.2074921-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-10-12Merge tag 'ras_updates_for_v5.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS updates from Borislav Petkov: - Extend the recovery from MCE in kernel space also to processes which encounter an MCE in kernel space but while copying from user memory by sending them a SIGBUS on return to user space and umapping the faulty memory, by Tony Luck and Youquan Song. - memcpy_mcsafe() rework by splitting the functionality into copy_mc_to_user() and copy_mc_to_kernel(). This, as a result, enables support for new hardware which can recover from a machine check encountered during a fast string copy and makes that the default and lets the older hardware which does not support that advance recovery, opt in to use the old, fragile, slow variant, by Dan Williams. - New AMD hw enablement, by Yazen Ghannam and Akshay Gupta. - Do not use MSR-tracing accessors in #MC context and flag any fault while accessing MCA architectural MSRs as an architectural violation with the hope that such hw/fw misdesigns are caught early during the hw eval phase and they don't make it into production. - Misc fixes, improvements and cleanups, as always. * tag 'ras_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Allow for copy_mc_fragile symbol checksum to be generated x86/mce: Decode a kernel instruction to determine if it is copying from user x86/mce: Recover from poison found while copying from user space x86/mce: Avoid tail copy when machine check terminated a copy from user x86/mce: Add _ASM_EXTABLE_CPY for copy user access x86/mce: Provide method to find out the type of an exception handler x86/mce: Pass pointer to saved pt_regs to severity calculation routines x86/copy_mc: Introduce copy_mc_enhanced_fast_string() x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}() x86/mce: Drop AMD-specific "DEFERRED" case from Intel severity rule list x86/mce: Add Skylake quirk for patrol scrub reported errors RAS/CEC: Convert to DEFINE_SHOW_ATTRIBUTE() x86/mce: Annotate mce_rd/wrmsrl() with noinstr x86/mce/dev-mcelog: Do not update kflags on AMD systems x86/mce: Stop mce_reign() from re-computing severity for every CPU x86/mce: Make mce_rdmsrl() panic on an inaccessible MSR x86/mce: Increase maximum number of banks to 64 x86/mce: Delay clearing IA32_MCG_STATUS to the end of do_machine_check() x86/MCE/AMD, EDAC/mce_amd: Remove struct smca_hwid.xec_bitmap RAS/CEC: Fix cec_init() prototype
2020-10-06x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()Dan Williams
In reaction to a proposal to introduce a memcpy_mcsafe_fast() implementation Linus points out that memcpy_mcsafe() is poorly named relative to communicating the scope of the interface. Specifically what addresses are valid to pass as source, destination, and what faults / exceptions are handled. Of particular concern is that even though x86 might be able to handle the semantics of copy_mc_to_user() with its common copy_user_generic() implementation other archs likely need / want an explicit path for this case: On Fri, May 1, 2020 at 11:28 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, Apr 30, 2020 at 6:21 PM Dan Williams <dan.j.williams@intel.com> wrote: > > > > However now I see that copy_user_generic() works for the wrong reason. > > It works because the exception on the source address due to poison > > looks no different than a write fault on the user address to the > > caller, it's still just a short copy. So it makes copy_to_user() work > > for the wrong reason relative to the name. > > Right. > > And it won't work that way on other architectures. On x86, we have a > generic function that can take faults on either side, and we use it > for both cases (and for the "in_user" case too), but that's an > artifact of the architecture oddity. > > In fact, it's probably wrong even on x86 - because it can hide bugs - > but writing those things is painful enough that everybody prefers > having just one function. Replace a single top-level memcpy_mcsafe() with either copy_mc_to_user(), or copy_mc_to_kernel(). Introduce an x86 copy_mc_fragile() name as the rename for the low-level x86 implementation formerly named memcpy_mcsafe(). It is used as the slow / careful backend that is supplanted by a fast copy_mc_generic() in a follow-on patch. One side-effect of this reorganization is that separating copy_mc_64.S to its own file means that perf no longer needs to track dependencies for its memcpy_64.S benchmarks. [ bp: Massage a bit. ] Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: <stable@vger.kernel.org> Link: http://lore.kernel.org/r/CAHk-=wjSqtXAqfUJxFtWNwmguFASTgB0dz1dT3V-78Quiezqbg@mail.gmail.com Link: https://lkml.kernel.org/r/160195561680.2163339.11574962055305783722.stgit@dwillia2-desk3.amr.corp.intel.com
2020-09-14perf bench: Fix 2 memory sanitizer warningsIan Rogers
Memory sanitizer warns if a write is performed where the memory being read for the write is uninitialized. Avoid this warning by initializing the memory. Signed-off-by: Ian Rogers <irogers@google.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lore.kernel.org/lkml/20200912053725.1405857-1-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-09-03perf bench: The do_run_multi_threaded() function must use ↵YueHaibing
IS_ERR(perf_session__new()) In case of error, the function perf_session__new() returns ERR_PTR() and never returns NULL. The NULL test in the return value check should be replaced with IS_ERR() Committer notes: This wasn't compiling due to an extraneous '{' not matched by a '}', fix it. Fixes: 13edc237200c ("perf bench: Add a multi-threaded synthesize benchmark") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/20200902140526.26916-1-yuehaibing@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-08-14perf bench numa: Remove dead code in parse_nodes_opt()Peng Fan
In the function parse_nodes_opt(), the statement "return 0;" is dead code, remove it. Signed-off-by: Peng Fan <fanpeng@loongson.cn> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/1597401894-27549-1-git-send-email-fanpeng@loongson.cn Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-08-13perf bench numa: Use numa_node_to_cpus() to bind tasks to nodesAlexander Gordeev
It is currently assumed that each node contains at most nr_cpus/nr_nodes CPUs and nodes' CPU ranges do not overlap. That assumption is generally incorrect as there are archs where a CPU number does not depend on to its node number. This update removes the described assumption by simply calling numa_node_to_cpus() interface and using the returned mask for binding CPUs to nodes. Also, variable types and names made consistent in functions using cpumask. Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Balamuruhan S <bala24@linux.vnet.ibm.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com> Link: http://lore.kernel.org/lkml/20200813113247.GA2014@oc3871087118.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-08-13perf bench numa: Fix cpumask memory leak in node_has_cpus()Alexander Gordeev
Couple numa_allocate_cpumask() and numa_free_cpumask() functions Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Balamuruhan S <bala24@linux.vnet.ibm.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com> Link: http://lore.kernel.org/lkml/20200813113041.GA1685@oc3871087118.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-08-13perf bench mem: Always memset source before memcpyVincent Whitchurch
For memcpy, the source pages are memset to zero only when --cycles is used. This leads to wildly different results with or without --cycles, since all sources pages are likely to be mapped to the same zero page without explicit writes. Before this fix: $ export cmd="./perf stat -e LLC-loads -- ./perf bench \ mem memcpy -s 1024MB -l 100 -f default" $ $cmd 2,935,826 LLC-loads 3.821677452 seconds time elapsed $ $cmd --cycles 217,533,436 LLC-loads 8.616725985 seconds time elapsed After this fix: $ $cmd 214,459,686 LLC-loads 8.674301124 seconds time elapsed $ $cmd --cycles 214,758,651 LLC-loads 8.644480006 seconds time elapsed Fixes: 47b5757bac03c338 ("perf bench mem: Move boilerplate memory allocation to the infrastructure") Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: kernel@axis.com Link: http://lore.kernel.org/lkml/20200810133404.30829-1-vincent.whitchurch@axis.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-08-12perf bench: Fix a couple of spelling mistakes in options textColin Ian King
There are a couple of spelling mistakes in the text. Fix these. Signed-off-by: Colin King <colin.king@canonical.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: kernel-janitors@vger.kernel.org Link: http://lore.kernel.org/lkml/20200812064647.200132-1-colin.king@canonical.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-08-12perf bench numa: Fix benchmark namesAlexander Gordeev
Standard benchmark names let users know the tests specifics. For example "2x1-bw-process" name tells that two processes one thread each are run and the RAM bandwidth is measured. Several benchmarks names do not correspond to their actual running configuration. Fix that and also some whitespace and comment inconsistencies. Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/6b6f2084f132ee8e9203dc7c32f9deb209b87a68.1597004831.git.agordeev@linux.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-08-12perf bench numa: Fix number of processes in "2x3-convergence" testAlexander Gordeev
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/d949f5f48e17fc816f3beecf8479f1b2480345e4.1597004831.git.agordeev@linux.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-07-31perf bench: Add benchmark of find_next_bitIan Rogers
for_each_set_bit, or similar functions like for_each_cpu, may be hot within the kernel. If many bits were set then one could imagine on Intel a "bt" instruction with every bit may be faster than the function call and word length find_next_bit logic. Add a benchmark to measure this. This benchmark on AMD rome and Intel skylakex shows "bt" is not a good option except for very small bitmaps. Committer testing: # perf bench Usage: perf bench [<common options>] <collection> <benchmark> [<options>] # List of all available benchmark collections: sched: Scheduler and IPC benchmarks syscall: System call benchmarks mem: Memory access benchmarks numa: NUMA scheduling and MM benchmarks futex: Futex stressing benchmarks epoll: Epoll stressing benchmarks internals: Perf-internals benchmarks all: All benchmarks # perf bench mem # List of available benchmarks for collection 'mem': memcpy: Benchmark for memcpy() functions memset: Benchmark for memset() functions find_bit: Benchmark for find_bit() functions all: Run all memory access benchmarks # perf bench mem find_bit # Running 'mem/find_bit' benchmark: 100000 operations 1 bits set of 1 bits Average for_each_set_bit took: 730.200 usec (+- 6.468 usec) Average test_bit loop took: 366.200 usec (+- 4.652 usec) 100000 operations 1 bits set of 2 bits Average for_each_set_bit took: 781.000 usec (+- 24.247 usec) Average test_bit loop took: 550.200 usec (+- 4.152 usec) 100000 operations 2 bits set of 2 bits Average for_each_set_bit took: 1113.400 usec (+- 112.340 usec) Average test_bit loop took: 1098.500 usec (+- 182.834 usec) 100000 operations 1 bits set of 4 bits Average for_each_set_bit took: 843.800 usec (+- 8.772 usec) Average test_bit loop took: 948.800 usec (+- 10.278 usec) 100000 operations 2 bits set of 4 bits Average for_each_set_bit took: 1185.800 usec (+- 114.345 usec) Average test_bit loop took: 1473.200 usec (+- 175.498 usec) 100000 operations 4 bits set of 4 bits Average for_each_set_bit took: 1769.667 usec (+- 233.177 usec) Average test_bit loop took: 1864.933 usec (+- 187.470 usec) 100000 operations 1 bits set of 8 bits Average for_each_set_bit took: 898.000 usec (+- 21.755 usec) Average test_bit loop took: 1768.400 usec (+- 23.672 usec) 100000 operations 2 bits set of 8 bits Average for_each_set_bit took: 1244.900 usec (+- 116.396 usec) Average test_bit loop took: 2201.800 usec (+- 145.398 usec) 100000 operations 4 bits set of 8 bits Average for_each_set_bit took: 1822.533 usec (+- 231.554 usec) Average test_bit loop took: 2569.467 usec (+- 168.453 usec) 100000 operations 8 bits set of 8 bits Average for_each_set_bit took: 2845.100 usec (+- 441.365 usec) Average test_bit loop took: 3023.300 usec (+- 219.575 usec) 100000 operations 1 bits set of 16 bits Average for_each_set_bit took: 923.400 usec (+- 17.560 usec) Average test_bit loop took: 3240.000 usec (+- 16.492 usec) 100000 operations 2 bits set of 16 bits Average for_each_set_bit took: 1264.300 usec (+- 114.034 usec) Average test_bit loop took: 3714.400 usec (+- 158.898 usec) 100000 operations 4 bits set of 16 bits Average for_each_set_bit took: 1817.867 usec (+- 222.199 usec) Average test_bit loop took: 4015.333 usec (+- 154.162 usec) 100000 operations 8 bits set of 16 bits Average for_each_set_bit took: 2826.350 usec (+- 433.457 usec) Average test_bit loop took: 4460.350 usec (+- 210.762 usec) 100000 operations 16 bits set of 16 bits Average for_each_set_bit took: 4615.600 usec (+- 809.350 usec) Average test_bit loop took: 5129.960 usec (+- 320.821 usec) 100000 operations 1 bits set of 32 bits Average for_each_set_bit took: 904.400 usec (+- 14.250 usec) Average test_bit loop took: 6194.000 usec (+- 29.254 usec) 100000 operations 2 bits set of 32 bits Average for_each_set_bit took: 1252.700 usec (+- 116.432 usec) Average test_bit loop took: 6652.400 usec (+- 154.352 usec) 100000 operations 4 bits set of 32 bits Average for_each_set_bit took: 1824.200 usec (+- 229.133 usec) Average test_bit loop took: 6961.733 usec (+- 154.682 usec) 100000 operations 8 bits set of 32 bits Average for_each_set_bit took: 2823.950 usec (+- 432.296 usec) Average test_bit loop took: 7351.900 usec (+- 193.626 usec) 100000 operations 16 bits set of 32 bits Average for_each_set_bit took: 4552.560 usec (+- 785.141 usec) Average test_bit loop took: 7998.360 usec (+- 305.629 usec) 100000 operations 32 bits set of 32 bits Average for_each_set_bit took: 7557.067 usec (+- 1407.702 usec) Average test_bit loop took: 9072.400 usec (+- 513.209 usec) 100000 operations 1 bits set of 64 bits Average for_each_set_bit took: 896.800 usec (+- 14.389 usec) Average test_bit loop took: 11927.200 usec (+- 68.862 usec) 100000 operations 2 bits set of 64 bits Average for_each_set_bit took: 1230.400 usec (+- 111.731 usec) Average test_bit loop took: 12478.600 usec (+- 189.382 usec) 100000 operations 4 bits set of 64 bits Average for_each_set_bit took: 1844.733 usec (+- 244.826 usec) Average test_bit loop took: 12911.467 usec (+- 206.246 usec) 100000 operations 8 bits set of 64 bits Average for_each_set_bit took: 2779.300 usec (+- 413.612 usec) Average test_bit loop took: 13372.650 usec (+- 239.623 usec) 100000 operations 16 bits set of 64 bits Average for_each_set_bit took: 4423.920 usec (+- 748.240 usec) Average test_bit loop took: 13995.800 usec (+- 318.427 usec) 100000 operations 32 bits set of 64 bits Average for_each_set_bit took: 7580.600 usec (+- 1462.407 usec) Average test_bit loop took: 15063.067 usec (+- 516.477 usec) 100000 operations 64 bits set of 64 bits Average for_each_set_bit took: 13391.514 usec (+- 2765.371 usec) Average test_bit loop took: 16974.914 usec (+- 916.936 usec) 100000 operations 1 bits set of 128 bits Average for_each_set_bit took: 1153.800 usec (+- 124.245 usec) Average test_bit loop took: 26959.000 usec (+- 714.047 usec) 100000 operations 2 bits set of 128 bits Average for_each_set_bit took: 1445.200 usec (+- 113.587 usec) Average test_bit loop took: 25798.800 usec (+- 512.908 usec) 100000 operations 4 bits set of 128 bits Average for_each_set_bit took: 1990.933 usec (+- 219.362 usec) Average test_bit loop took: 25589.400 usec (+- 348.288 usec) 100000 operations 8 bits set of 128 bits Average for_each_set_bit took: 2963.000 usec (+- 419.487 usec) Average test_bit loop took: 25690.050 usec (+- 262.025 usec) 100000 operations 16 bits set of 128 bits Average for_each_set_bit took: 4585.200 usec (+- 741.734 usec) Average test_bit loop took: 26125.040 usec (+- 274.127 usec) 100000 operations 32 bits set of 128 bits Average for_each_set_bit took: 7626.200 usec (+- 1404.950 usec) Average test_bit loop took: 27038.867 usec (+- 442.554 usec) 100000 operations 64 bits set of 128 bits Average for_each_set_bit took: 13343.371 usec (+- 2686.460 usec) Average test_bit loop took: 28936.543 usec (+- 883.257 usec) 100000 operations 128 bits set of 128 bits Average for_each_set_bit took: 23442.950 usec (+- 4880.541 usec) Average test_bit loop took: 32484.125 usec (+- 1691.931 usec) 100000 operations 1 bits set of 256 bits Average for_each_set_bit took: 1183.000 usec (+- 32.073 usec) Average test_bit loop took: 50114.600 usec (+- 198.880 usec) 100000 operations 2 bits set of 256 bits Average for_each_set_bit took: 1550.000 usec (+- 124.550 usec) Average test_bit loop took: 50334.200 usec (+- 128.425 usec) 100000 operations 4 bits set of 256 bits Average for_each_set_bit took: 2164.333 usec (+- 246.359 usec) Average test_bit loop took: 49959.867 usec (+- 188.035 usec) 100000 operations 8 bits set of 256 bits Average for_each_set_bit took: 3211.200 usec (+- 454.829 usec) Average test_bit loop took: 50140.850 usec (+- 176.046 usec) 100000 operations 16 bits set of 256 bits Average for_each_set_bit took: 5181.640 usec (+- 882.726 usec) Average test_bit loop took: 51003.160 usec (+- 419.601 usec) 100000 operations 32 bits set of 256 bits Average for_each_set_bit took: 8369.333 usec (+- 1513.150 usec) Average test_bit loop took: 52096.700 usec (+- 573.022 usec) 100000 operations 64 bits set of 256 bits Average for_each_set_bit took: 13866.857 usec (+- 2649.393 usec) Average test_bit loop took: 53989.600 usec (+- 938.808 usec) 100000 operations 128 bits set of 256 bits Average for_each_set_bit took: 23588.350 usec (+- 4724.222 usec) Average test_bit loop took: 57300.625 usec (+- 1625.962 usec) 100000 operations 256 bits set of 256 bits Average for_each_set_bit took: 42752.200 usec (+- 9202.084 usec) Average test_bit loop took: 64426.933 usec (+- 3402.326 usec) 100000 operations 1 bits set of 512 bits Average for_each_set_bit took: 1632.000 usec (+- 229.954 usec) Average test_bit loop took: 98090.000 usec (+- 1120.435 usec) 100000 operations 2 bits set of 512 bits Average for_each_set_bit took: 1937.700 usec (+- 148.902 usec) Average test_bit loop took: 100364.100 usec (+- 1433.219 usec) 100000 operations 4 bits set of 512 bits Average for_each_set_bit took: 2528.000 usec (+- 243.654 usec) Average test_bit loop took: 99932.067 usec (+- 955.868 usec) 100000 operations 8 bits set of 512 bits Average for_each_set_bit took: 3734.100 usec (+- 512.359 usec) Average test_bit loop took: 98944.750 usec (+- 812.070 usec) 100000 operations 16 bits set of 512 bits Average for_each_set_bit took: 5551.400 usec (+- 846.605 usec) Average test_bit loop took: 98691.600 usec (+- 654.753 usec) 100000 operations 32 bits set of 512 bits Average for_each_set_bit took: 8594.500 usec (+- 1446.072 usec) Average test_bit loop took: 99176.867 usec (+- 579.990 usec) 100000 operations 64 bits set of 512 bits Average for_each_set_bit took: 13840.743 usec (+- 2527.055 usec) Average test_bit loop took: 100758.743 usec (+- 833.865 usec) 100000 operations 128 bits set of 512 bits Average for_each_set_bit took: 23185.925 usec (+- 4532.910 usec) Average test_bit loop took: 103786.700 usec (+- 1475.276 usec) 100000 operations 256 bits set of 512 bits Average for_each_set_bit took: 40322.400 usec (+- 8341.802 usec) Average test_bit loop took: 109433.378 usec (+- 2742.615 usec) 100000 operations 512 bits set of 512 bits Average for_each_set_bit took: 71804.540 usec (+- 15436.546 usec) Average test_bit loop took: 120255.440 usec (+- 5252.777 usec) 100000 operations 1 bits set of 1024 bits Average for_each_set_bit took: 1859.600 usec (+- 27.969 usec) Average test_bit loop took: 187676.000 usec (+- 1337.770 usec) 100000 operations 2 bits set of 1024 bits Average for_each_set_bit took: 2273.600 usec (+- 139.420 usec) Average test_bit loop took: 188176.000 usec (+- 684.357 usec) 100000 operations 4 bits set of 1024 bits Average for_each_set_bit took: 2940.400 usec (+- 268.213 usec) Average test_bit loop took: 189172.600 usec (+- 593.295 usec) 100000 operations 8 bits set of 1024 bits Average for_each_set_bit took: 4224.200 usec (+- 547.933 usec) Average test_bit loop took: 190257.250 usec (+- 621.021 usec) 100000 operations 16 bits set of 1024 bits Average for_each_set_bit took: 6090.560 usec (+- 877.975 usec) Average test_bit loop took: 190143.880 usec (+- 503.753 usec) 100000 operations 32 bits set of 1024 bits Average for_each_set_bit took: 9178.800 usec (+- 1475.136 usec) Average test_bit loop took: 190757.100 usec (+- 494.757 usec) 100000 operations 64 bits set of 1024 bits Average for_each_set_bit took: 14441.457 usec (+- 2545.497 usec) Average test_bit loop took: 192299.486 usec (+- 795.251 usec) 100000 operations 128 bits set of 1024 bits Average for_each_set_bit took: 23623.825 usec (+- 4481.182 usec) Average test_bit loop took: 194885.550 usec (+- 1300.817 usec) 100000 operations 256 bits set of 1024 bits Average for_each_set_bit took: 40194.956 usec (+- 8109.056 usec) Average test_bit loop took: 200259.311 usec (+- 2566.085 usec) 100000 operations 512 bits set of 1024 bits Average for_each_set_bit took: 70983.560 usec (+- 15074.982 usec) Average test_bit loop took: 210527.460 usec (+- 4968.980 usec) 100000 operations 1024 bits set of 1024 bits Average for_each_set_bit took: 136530.345 usec (+- 31584.400 usec) Average test_bit loop took: 233329.691 usec (+- 10814.036 usec) 100000 operations 1 bits set of 2048 bits Average for_each_set_bit took: 3077.600 usec (+- 76.376 usec) Average test_bit loop took: 402154.400 usec (+- 518.571 usec) 100000 operations 2 bits set of 2048 bits Average for_each_set_bit took: 3508.600 usec (+- 148.350 usec) Average test_bit loop took: 403814.500 usec (+- 1133.027 usec) 100000 operations 4 bits set of 2048 bits Average for_each_set_bit took: 4219.333 usec (+- 285.844 usec) Average test_bit loop took: 404312.533 usec (+- 985.751 usec) 100000 operations 8 bits set of 2048 bits Average for_each_set_bit took: 5670.550 usec (+- 615.238 usec) Average test_bit loop took: 405321.800 usec (+- 1038.487 usec) 100000 operations 16 bits set of 2048 bits Average for_each_set_bit took: 7785.080 usec (+- 992.522 usec) Average test_bit loop took: 406746.160 usec (+- 1015.478 usec) 100000 operations 32 bits set of 2048 bits Average for_each_set_bit took: 11163.800 usec (+- 1627.320 usec) Average test_bit loop took: 406124.267 usec (+- 898.785 usec) 100000 operations 64 bits set of 2048 bits Average for_each_set_bit took: 16964.629 usec (+- 2806.130 usec) Average test_bit loop took: 406618.514 usec (+- 798.356 usec) 100000 operations 128 bits set of 2048 bits Average for_each_set_bit took: 27219.625 usec (+- 4988.458 usec) Average test_bit loop took: 410149.325 usec (+- 1705.641 usec) 100000 operations 256 bits set of 2048 bits Average for_each_set_bit took: 45138.578 usec (+- 8831.021 usec) Average test_bit loop took: 415462.467 usec (+- 2725.418 usec) 100000 operations 512 bits set of 2048 bits Average for_each_set_bit took: 77450.540 usec (+- 15962.238 usec) Average test_bit loop took: 426089.180 usec (+- 5171.788 usec) 100000 operations 1024 bits set of 2048 bits Average for_each_set_bit took: 138023.636 usec (+- 29826.959 usec) Average test_bit loop took: 446346.636 usec (+- 9904.417 usec) 100000 operations 2048 bits set of 2048 bits Average for_each_set_bit took: 251072.600 usec (+- 55947.692 usec) Average test_bit loop took: 484855.983 usec (+- 18970.431 usec) # Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lore.kernel.org/lkml/20200729220034.1337168-1-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-07-28perf bench: Add basic syscall benchmarkDavidlohr Bueso
The usefulness of having a standard way of testing syscall performance has come up from time to time[0]. Furthermore, some of our testing machinery (such as 'mmtests') already makes use of a simplified version of the microbenchmark. This patch mainly takes the same idea to measure syscall throughput compatible with 'perf-bench' via getppid(2), yet without any of the additional template stuff from Ingo's version (based on numa.c). The code is identical to what mmtests uses. [0] https://lore.kernel.org/lkml/20160201074156.GA27156@gmail.com/ Committer notes: Add mising stdlib.h and unistd.h to get the prototypes for exit() and getppid(). Committer testing: $ perf bench Usage: perf bench [<common options>] <collection> <benchmark> [<options>] # List of all available benchmark collections: sched: Scheduler and IPC benchmarks syscall: System call benchmarks mem: Memory access benchmarks numa: NUMA scheduling and MM benchmarks futex: Futex stressing benchmarks epoll: Epoll stressing benchmarks internals: Perf-internals benchmarks all: All benchmarks $ $ perf bench syscall # List of available benchmarks for collection 'syscall': basic: Benchmark for basic getppid(2) calls all: Run all syscall benchmarks $ perf bench syscall basic # Running 'syscall/basic' benchmark: # Executed 10000000 getppid() calls Total time: 3.679 [sec] 0.367957 usecs/op 2717708 ops/sec $ perf bench syscall all # Running syscall/basic benchmark... # Executed 10000000 getppid() calls Total time: 3.644 [sec] 0.364456 usecs/op 2743815 ops/sec $ Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lore.kernel.org/lkml/20190308181747.l36zqz2avtivrr3c@linux-r8p5 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-05-28perf tools: Replace zero-length array with flexible-arrayGustavo A. R. Silva
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] sizeof(flexible-array-member) triggers a warning because flexible array members have incomplete type[1]. There are some instances of code in which the sizeof operator is being incorrectly/erroneously applied to zero-length arrays and the result is zero. Such instances may be hiding some bugs. So, this work (flexible-array member conversions) will also help to get completely rid of those sorts of issues. This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Gustavo A. R. Silva <gustavo@embeddedor.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/20200515172926.GA31976@embeddedor Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-05-28tools feature: Rename HAVE_EVENTFD to HAVE_EVENTFD_SUPPORTArnaldo Carvalho de Melo
To be consistent with other such auto-detected features. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Anand K Mistry <amistry@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-05-05perf bench: Add kallsyms parsingIan Rogers
Add a benchmark for kallsyms parsing. Example output: Running 'internals/kallsyms-parse' benchmark: Average kallsyms__parse took: 103.971 ms (+- 0.121 ms) Committer testing: Test Machine: AMD Ryzen 5 3600X 6-Core Processor [root@five ~]# perf bench internals kallsyms-parse # Running 'internals/kallsyms-parse' benchmark: Average kallsyms__parse took: 79.692 ms (+- 0.101 ms) [root@five ~]# perf stat -r5 perf bench internals kallsyms-parse # Running 'internals/kallsyms-parse' benchmark: Average kallsyms__parse took: 80.563 ms (+- 0.079 ms) # Running 'internals/kallsyms-parse' benchmark: Average kallsyms__parse took: 81.046 ms (+- 0.155 ms) # Running 'internals/kallsyms-parse' benchmark: Average kallsyms__parse took: 80.874 ms (+- 0.104 ms) # Running 'internals/kallsyms-parse' benchmark: Average kallsyms__parse took: 81.173 ms (+- 0.133 ms) # Running 'internals/kallsyms-parse' benchmark: Average kallsyms__parse took: 81.169 ms (+- 0.074 ms) Performance counter stats for 'perf bench internals kallsyms-parse' (5 runs): 8,093.54 msec task-clock # 0.999 CPUs utilized ( +- 0.14% ) 3,165 context-switches # 0.391 K/sec ( +- 0.18% ) 10 cpu-migrations # 0.001 K/sec ( +- 23.13% ) 744 page-faults # 0.092 K/sec ( +- 0.21% ) 34,551,564,954 cycles # 4.269 GHz ( +- 0.05% ) (83.33%) 1,160,584,308 stalled-cycles-frontend # 3.36% frontend cycles idle ( +- 1.60% ) (83.33%) 14,974,323,985 stalled-cycles-backend # 43.34% backend cycles idle ( +- 0.24% ) (83.33%) 58,712,905,705 instructions # 1.70 insn per cycle # 0.26 stalled cycles per insn ( +- 0.01% ) (83.34%) 14,136,433,778 branches # 1746.632 M/sec ( +- 0.01% ) (83.33%) 141,943,217 branch-misses # 1.00% of all branches ( +- 0.04% ) (83.33%) 8.1040 +- 0.0115 seconds time elapsed ( +- 0.14% ) [root@five ~]# Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lore.kernel.org/lkml/20200501221315.54715-2-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-04-30perf bench: Add a multi-threaded synthesize benchmarkIan Rogers
By default this isn't run as it reads /proc and may not have access. For consistency, modify the single threaded benchmark to compute an average time per event. Committer testing: $ grep -m1 "model name" /proc/cpuinfo model name : Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz $ grep "model name" /proc/cpuinfo | wc -l 8 $ $ perf bench internals synthesize -h # Running 'internals/synthesize' benchmark: Usage: perf bench internals synthesize <options> -I, --multi-iterations <n> Number of iterations used to compute multi-threaded average -i, --single-iterations <n> Number of iterations used to compute single-threaded average -M, --max-threads <n> Maximum number of threads in multithreaded bench -m, --min-threads <n> Minimum number of threads in multithreaded bench -s, --st Run single threaded benchmark -t, --mt Run multi-threaded benchmark $ $ perf bench internals synthesize -t # Running 'internals/synthesize' benchmark: Computing performance of multi threaded perf event synthesis by synthesizing events on CPU 0: Number of synthesis threads: 1 Average synthesis took: 65449.000 usec (+- 586.442 usec) Average num. events: 9405.400 (+- 0.306) Average time per event 6.959 usec Number of synthesis threads: 2 Average synthesis took: 37838.300 usec (+- 130.259 usec) Average num. events: 9501.800 (+- 20.469) Average time per event 3.982 usec Number of synthesis threads: 3 Average synthesis took: 48551.400 usec (+- 225.686 usec) Average num. events: 9544.000 (+- 0.000) Average time per event 5.087 usec Number of synthesis threads: 4 Average synthesis took: 29632.500 usec (+- 50.808 usec) Average num. events: 9544.000 (+- 0.000) Average time per event 3.105 usec Number of synthesis threads: 5 Average synthesis took: 33920.400 usec (+- 284.509 usec) Average num. events: 9544.000 (+- 0.000) Average time per event 3.554 usec Number of synthesis threads: 6 Average synthesis took: 27604.100 usec (+- 72.344 usec) Average num. events: 9548.000 (+- 0.000) Average time per event 2.891 usec Number of synthesis threads: 7 Average synthesis took: 25406.300 usec (+- 933.371 usec) Average num. events: 9545.500 (+- 0.167) Average time per event 2.662 usec Number of synthesis threads: 8 Average synthesis took: 24110.400 usec (+- 73.229 usec) Average num. events: 9551.000 (+- 0.000) Average time per event 2.524 usec $ Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andrey Zhizhikin <andrey.z@gmail.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lore.kernel.org/lkml/20200415054050.31645-2-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-04-22perf bench: Fix div-by-zero if runtime is zeroTommi Rantala
Fix div-by-zero if runtime is zero: $ perf bench futex hash --runtime=0 # Running 'futex/hash' benchmark: Run summary [PID 12090]: 4 threads, each operating on 1024 [private] futexes for 0 secs. Floating point exception (core dumped) Signed-off-by: Tommi Rantala <tommi.t.rantala@nokia.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Darren Hart <dvhart@infradead.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lore.kernel.org/lkml/20200417132330.119407-4-tommi.t.rantala@nokia.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-04-16perf bench: Add event synthesis benchmarkIan Rogers
Event synthesis may occur at the start or end (tail) of a perf command. In system-wide mode it can scan every process in /proc, which may add seconds of latency before event recording. Add a new benchmark that times how long event synthesis takes with and without data synthesis. An example execution looks like: $ perf bench internals synthesize # Running 'internals/synthesize' benchmark: Average synthesis took: 168.253800 usec Average data synthesis took: 208.104700 usec Signed-off-by: Ian Rogers <irogers@google.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andrey Zhizhikin <andrey.z@gmail.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lore.kernel.org/lkml/20200402154357.107873-2-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-03-06perf bench: Clear struct sigaction before sigaction() syscallTommi Rantala
Avoid garbage in sigaction structs used in sigaction() syscalls. Valgrind is complaining about it. Signed-off-by: Tommi Rantala <tommi.t.rantala@nokia.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Changbin Du <changbin.du@intel.com> Cc: Darren Hart <dvhart@infradead.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lore.kernel.org/lkml/20200305083714.9381-4-tommi.t.rantala@nokia.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-03-06perf bench futex-wake: Restore thread count default to online CPU countTommi Rantala
Since commit 3b2323c2c1c4 ("perf bench futex: Use cpumaps") the default number of threads the benchmark uses got changed from number of online CPUs to zero: $ perf bench futex wake # Running 'futex/wake' benchmark: Run summary [PID 15930]: blocking on 0 threads (at [private] futex 0x558b8ee4bfac), waking up 1 at a time. [Run 1]: Wokeup 0 of 0 threads in 0.0000 ms [...] [Run 10]: Wokeup 0 of 0 threads in 0.0000 ms Wokeup 0 of 0 threads in 0.0004 ms (+-40.82%) Restore the old behavior by grabbing the number of online CPUs via cpu->nr: $ perf bench futex wake # Running 'futex/wake' benchmark: Run summary [PID 18356]: blocking on 8 threads (at [private] futex 0xb3e62c), waking up 1 at a time. [Run 1]: Wokeup 8 of 8 threads in 0.0260 ms [...] [Run 10]: Wokeup 8 of 8 threads in 0.0270 ms Wokeup 8 of 8 threads in 0.0419 ms (+-24.35%) Fixes: 3b2323c2c1c4 ("perf bench futex: Use cpumaps") Signed-off-by: Tommi Rantala <tommi.t.rantala@nokia.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Darren Hart <dvhart@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lore.kernel.org/lkml/20200305083714.9381-3-tommi.t.rantala@nokia.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-03-03perf bench: Share some global variables to fix build with gcc 10Arnaldo Carvalho de Melo
Noticed with gcc 10 (fedora rawhide) that those variables were not being declared as static, so end up with: ld: /tmp/build/perf/bench/epoll-wait.o:/git/perf/tools/perf/bench/epoll-wait.c:93: multiple definition of `end'; /tmp/build/perf/bench/futex-hash.o:/git/perf/tools/perf/bench/futex-hash.c:40: first defined here ld: /tmp/build/perf/bench/epoll-wait.o:/git/perf/tools/perf/bench/epoll-wait.c:93: multiple definition of `start'; /tmp/build/perf/bench/futex-hash.o:/git/perf/tools/perf/bench/futex-hash.c:40: first defined here ld: /tmp/build/perf/bench/epoll-wait.o:/git/perf/tools/perf/bench/epoll-wait.c:93: multiple definition of `runtime'; /tmp/build/perf/bench/futex-hash.o:/git/perf/tools/perf/bench/futex-hash.c:40: first defined here ld: /tmp/build/perf/bench/epoll-ctl.o:/git/perf/tools/perf/bench/epoll-ctl.c:38: multiple definition of `end'; /tmp/build/perf/bench/futex-hash.o:/git/perf/tools/perf/bench/futex-hash.c:40: first defined here ld: /tmp/build/perf/bench/epoll-ctl.o:/git/perf/tools/perf/bench/epoll-ctl.c:38: multiple definition of `start'; /tmp/build/perf/bench/futex-hash.o:/git/perf/tools/perf/bench/futex-hash.c:40: first defined here ld: /tmp/build/perf/bench/epoll-ctl.o:/git/perf/tools/perf/bench/epoll-ctl.c:38: multiple definition of `runtime'; /tmp/build/perf/bench/futex-hash.o:/git/perf/tools/perf/bench/futex-hash.c:40: first defined here make[4]: *** [/git/perf/tools/build/Makefile.build:145: /tmp/build/perf/bench/perf-in.o] Error 1 Prefix those with bench__ and add them to bench/bench.h, so that we can share those on the tools needing to access those variables from signal handlers. Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lore.kernel.org/lkml/20200303155811.GD13702@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-09-20perf env: Remove needless cpumap.h headerArnaldo Carvalho de Melo
Only a 'struct perf_cmp_map' forward allocation is necessary, fix the places that need the header but were getting it indirectly, by luck, from env.h. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lkml.kernel.org/n/tip-3sj3n534zghxhk7ygzeaqlx9@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-09-20perf tools: Remove util.h from where it is not neededArnaldo Carvalho de Melo
Check that it is not needed and remove, fixing up some fallout for places where it was only serving to get something else. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lkml.kernel.org/n/tip-9h6dg6lsqe2usyqjh5rrues4@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-09-20perf tools: Remove needless builtin.h include directivesArnaldo Carvalho de Melo
Now that builtin.h isn't included by any other header, we can check where it is really needed, i.e. we can remove it and be sure that it isn't being obtained indirectly. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lkml.kernel.org/n/tip-mn7jheex85iw9qo6tlv26hb2@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-08-29perf tools: Remove perf.h from source files not needing itArnaldo Carvalho de Melo
With the movement of lots of stuff out of perf.h to other headers we ended up not needing it in lots of places, remove it from those places. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lkml.kernel.org/n/tip-c718m0sxxwp73lp9d8vpihb4@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-08-29perf tools: Move everything related to sys_perf_event_open() to perf-sys.hArnaldo Carvalho de Melo
And remove unneeded include directives from perf-sys.h to prune the header dependency tree. Fixup the fallout in places where definitions were being used without the needed include directives that were being satisfied because they were in perf-sys.h. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lkml.kernel.org/n/tip-7b1zvugiwak4ibfa3j6ott7f@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-08-12Merge remote-tracking branch 'torvalds/master' into perf/coreArnaldo Carvalho de Melo
To get closer to upstream and check if we need to sync more UAPI headers, pick up fixes for libbpf that prevent perf's container tests from completing successfuly, etc. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-08-01perf bench numa: Fix cpu0 bindingJiri Olsa
Michael reported an issue with perf bench numa failing with binding to cpu0 with '-0' option. # perf bench numa mem -p 3 -t 1 -P 512 -s 100 -zZcm0 --thp 1 -M 1 -ddd # Running 'numa/mem' benchmark: # Running main, "perf bench numa numa-mem -p 3 -t 1 -P 512 -s 100 -zZcm0 --thp 1 -M 1 -ddd" binding to node 0, mask: 0000000000000001 => -1 perf: bench/numa.c:356: bind_to_memnode: Assertion `!(ret)' failed. Aborted (core dumped) This happens when the cpu0 is not part of node0, which is the benchmark assumption and we can see that's not the case for some powerpc servers. Using correct node for cpu0 binding. Reported-by: Michael Petlan <mpetlan@redhat.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20190801142642.28004-1-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-07-29libperf: Add perf_cpu_map__new()/perf_cpu_map__read() functionsJiri Olsa
Moving the following functions from tools/perf: cpu_map__new() cpu_map__read() to libperf with the following names: perf_cpu_map__new() perf_cpu_map__read() Committer notes: Fixed up this one: tools/perf/arch/arm/util/cs-etm.c Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexey Budankov <alexey.budankov@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Michael Petlan <mpetlan@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20190721112506.12306-44-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-07-29perf cpu_map: Rename struct cpu_map to struct perf_cpu_mapJiri Olsa
Rename struct cpu_map to struct perf_cpu_map, so it could be part of libperf. Committer notes: Added fixes for arm64, provided by Jiri. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexey Budankov <alexey.budankov@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Michael Petlan <mpetlan@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20190721112506.12306-3-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-07-09perf tools: Use zfree() where applicableArnaldo Carvalho de Melo
In places where the equivalent was already being done, i.e.: free(a); a = NULL; And in placs where struct members are being freed so that if we have some erroneous reference to its struct, then accesses to freed members will result in segfaults, which we can detect faster than use after free to areas that may still have something seemingly valid. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lkml.kernel.org/n/tip-jatyoofo5boc1bsvoig6bb6i@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-07-09tools lib: Adopt zalloc()/zfree() from tools/perfArnaldo Carvalho de Melo
Eroding a bit more the tools/perf/util/util.h hodpodge header. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lkml.kernel.org/n/tip-natazosyn9rwjka25tvcnyi0@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-05-02perf bench numa: Add define for RUSAGE_THREAD if not presentArnaldo Carvalho de Melo
While cross building perf to the ARC architecture on a fedora 30 host, we were failing with: CC /tmp/build/perf/bench/numa.o bench/numa.c: In function ‘worker_thread’: bench/numa.c:1261:12: error: ‘RUSAGE_THREAD’ undeclared (first use in this function); did you mean ‘SIGEV_THREAD’? getrusage(RUSAGE_THREAD, &rusage); ^~~~~~~~~~~~~ SIGEV_THREAD bench/numa.c:1261:12: note: each undeclared identifier is reported only once for each function it appears in [perfbuilder@60d5802468f6 perf]$ /arc_gnu_2019.03-rc1_prebuilt_uclibc_le_archs_linux_install/bin/arc-linux-gcc --version | head -1 arc-linux-gcc (ARCv2 ISA Linux uClibc toolchain 2019.03-rc1) 8.3.1 20190225 [perfbuilder@60d5802468f6 perf]$ Trying to reproduce a report by Vineet, I noticed that, with just cross-built zlib and numactl libraries, I ended up with the above failure. So, since RUSAGE_THREAD is available as a define, check for that and numactl libraries, I ended up with the above failure. So, since RUSAGE_THREAD is available as a define in the system headers, check if it is defined in the 'perf bench numa' sources and define it if not. Now it builds and I have to figure out if the problem reported by Vineet only takes place if we have libelf or some other library available. Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jiri Olsa <jolsa@kernel.org> Cc: linux-snps-arc@lists.infradead.org Cc: Namhyung Kim <namhyung@kernel.org> Cc: Vineet Gupta <Vineet.Gupta1@synopsys.com> Link: https://lkml.kernel.org/n/tip-2wb4r1gir9xrevbpq7qp0amk@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-03-22Merge tag 'perf-core-for-mingo-5.1-20190321' of ↵Thomas Gleixner
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent Pull perf/core improvements and fixes from Arnaldo: BPF: Song Liu: - Add support for annotating BPF programs, using the PERF_RECORD_BPF_EVENT and PERF_RECORD_KSYMBOL recently added to the kernel and plugging binutils's libopcodes disassembly of BPF programs with the existing annotation interfaces in 'perf annotate', 'perf report' and 'perf top' various output formats (--stdio, --stdio2, --tui). perf list: Andi Kleen: - Filter metrics when using substring search. perf record: Andi Kleen: - Allow to limit number of reported perf.data files - Clarify help for --switch-output. perf report: Andi Kleen - Indicate JITed code better. - Show all sort keys in help output. perf script: Andi Kleen: - Support relative time. perf stat: Andi Kleen: - Improve scaling. General: Changbin Du: - Fix some mostly error path memory and reference count leaks found using gcc's ASan and UBSan. Vendor events: Mamatha Inamdar: - Remove P8 HW events which are not supported. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2019-03-19perf tools: Fix errors under optimization level '-Og'Changbin Du
Optimization level '-Og' offers a reasonable level of optimization while maintaining fast compilation and a good debugging experience. This patch tries to make it work. $ make DEBUG=1 EXTRA_CFLAGS='-Og' bench/epoll-ctl.c: In function ‘do_threads’: bench/epoll-ctl.c:274:9: error: ‘ret’ may be used uninitialized in this function [-Werror=maybe-uninitialized] return ret; ^~~ ... Signed-off-by: Changbin Du <changbin.du@gmail.com> Reviewed-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20190316080556.3075-4-changbin.du@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-03-05tools/: replace open encodings for NUMA_NO_NODEStephen Rothwell
This replaces all open encodings in tools with NUMA_NO_NODE. Also linux/numa.h is now needed for the perf build. [sfr@canb.auug.org.au: fix for replace open encodings for NUMA_NO_NODE] Link: http://lkml.kernel.org/r/20190108131141.730e9c4f@canb.auug.org.au Link: http://lkml.kernel.org/r/1545127933-10711-3-git-send-email-anshuman.khandual@arm.com Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: David Hildenbrand <david@redhat.com> Cc: Doug Ledford <dledford@redhat.com> [drivers/infiniband] Cc: Hans Verkuil <hverkuil@xs4all.nl> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> [ixgbe] Cc: Jens Axboe <axboe@kernel.dk> [mtip32xx] Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Cc: Vinod Koul <vkoul@kernel.org> [dmaengine.c] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-21perf bench: Add epoll_ctl(2) benchmarkDavidlohr Bueso
Benchmark the various operations allowed for epoll_ctl(2). The idea is to concurrently stress a single epoll instance doing add/mod/del operations. Committer testing: # perf bench epoll ctl # Running 'epoll/ctl' benchmark: Run summary [PID 20344]: 4 threads doing epoll_ctl ops 64 file-descriptors for 8 secs. [thread 0] fdmap: 0x21a46b0 ... 0x21a47ac [ add: 1680960 ops; mod: 1680960 ops; del: 1680960 ops ] [thread 1] fdmap: 0x21a4960 ... 0x21a4a5c [ add: 1685440 ops; mod: 1685440 ops; del: 1685440 ops ] [thread 2] fdmap: 0x21a4c10 ... 0x21a4d0c [ add: 1674368 ops; mod: 1674368 ops; del: 1674368 ops ] [thread 3] fdmap: 0x21a4ec0 ... 0x21a4fbc [ add: 1677568 ops; mod: 1677568 ops; del: 1677568 ops ] Averaged 1679584 ADD operations (+- 0.14%) Averaged 1679584 MOD operations (+- 0.14%) Averaged 1679584 DEL operations (+- 0.14%) # Lets measure those calls with 'perf trace' to get a glympse at what this benchmark is doing in terms of syscalls: # perf trace -m32768 -s perf bench epoll ctl # Running 'epoll/ctl' benchmark: Run summary [PID 20405]: 4 threads doing epoll_ctl ops 64 file-descriptors for 8 secs. [thread 0] fdmap: 0x21764e0 ... 0x21765dc [ add: 1100480 ops; mod: 1100480 ops; del: 1100480 ops ] [thread 1] fdmap: 0x2176790 ... 0x217688c [ add: 1250176 ops; mod: 1250176 ops; del: 1250176 ops ] [thread 2] fdmap: 0x2176a40 ... 0x2176b3c [ add: 1022464 ops; mod: 1022464 ops; del: 1022464 ops ] [thread 3] fdmap: 0x2176cf0 ... 0x2176dec [ add: 705472 ops; mod: 705472 ops; del: 705472 ops ] Averaged 1019648 ADD operations (+- 11.27%) Averaged 1019648 MOD operations (+- 11.27%) Averaged 1019648 DEL operations (+- 11.27%) Summary of events: epoll-ctl (20405), 1264 events, 0.0% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ eventfd2 256 9.514 0.001 0.037 5.243 68.00% clone 4 1.245 0.204 0.311 0.531 24.13% mprotect 66 0.345 0.002 0.005 0.021 7.43% openat 45 0.313 0.004 0.007 0.073 21.93% mmap 88 0.302 0.002 0.003 0.013 5.02% futex 4 0.160 0.002 0.040 0.140 83.43% sched_setaffinity 4 0.124 0.005 0.031 0.070 49.39% read 44 0.103 0.001 0.002 0.013 15.54% fstat 40 0.052 0.001 0.001 0.003 5.43% close 39 0.039 0.001 0.001 0.001 1.48% stat 9 0.034 0.003 0.004 0.006 7.30% access 3 0.023 0.007 0.008 0.008 4.25% open 2 0.021 0.008 0.011 0.013 22.60% getdents 4 0.019 0.001 0.005 0.009 37.15% write 2 0.013 0.004 0.007 0.009 38.48% munmap 1 0.010 0.010 0.010 0.010 0.00% brk 3 0.006 0.001 0.002 0.003 26.34% rt_sigprocmask 2 0.004 0.001 0.002 0.003 43.95% rt_sigaction 3 0.004 0.001 0.001 0.002 16.07% prlimit64 3 0.004 0.001 0.001 0.001 5.39% prctl 1 0.003 0.003 0.003 0.003 0.00% epoll_create 1 0.003 0.003 0.003 0.003 0.00% lseek 2 0.002 0.001 0.001 0.001 11.42% sched_getaffinity 1 0.002 0.002 0.002 0.002 0.00% arch_prctl 1 0.002 0.002 0.002 0.002 0.00% set_tid_address 1 0.001 0.001 0.001 0.001 0.00% getpid 1 0.001 0.001 0.001 0.001 0.00% set_robust_list 1 0.001 0.001 0.001 0.001 0.00% execve 1 0.000 0.000 0.000 0.000 0.00% epoll-ctl (20406), 1245480 events, 14.6% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ epoll_ctl 619511 1034.927 0.001 0.002 6.691 0.67% nanosleep 3226 616.114 0.006 0.191 10.376 7.57% futex 2 11.336 0.002 5.668 11.334 99.97% set_robust_list 1 0.001 0.001 0.001 0.001 0.00% clone 1 0.000 0.000 0.000 0.000 0.00% epoll-ctl (20407), 1243151 events, 14.5% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ epoll_ctl 618350 1042.181 0.001 0.002 2.512 0.40% nanosleep 3220 366.261 0.012 0.114 18.162 9.59% futex 4 5.463 0.001 1.366 5.427 99.12% set_robust_list 1 0.002 0.002 0.002 0.002 0.00% epoll-ctl (20408), 1801690 events, 21.1% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ epoll_ctl 896174 1540.581 0.001 0.002 6.987 0.74% nanosleep 4667 783.393 0.006 0.168 10.419 7.10% futex 2 4.682 0.002 2.341 4.681 99.93% set_robust_list 1 0.002 0.002 0.002 0.002 0.00% clone 1 0.000 0.000 0.000 0.000 0.00% epoll-ctl (20409), 4254890 events, 49.8% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ epoll_ctl 2116416 3768.097 0.001 0.002 9.956 0.41% nanosleep 11023 1141.778 0.006 0.104 9.447 4.95% futex 3 0.037 0.002 0.012 0.029 70.50% set_robust_list 1 0.008 0.008 0.008 0.008 0.00% madvise 1 0.005 0.005 0.005 0.005 0.00% clone 1 0.000 0.000 0.000 0.000 0.00% # Committer notes: Fix build on fedora:24-x-ARC-uClibc, debian:experimental-x-mips, debian:experimental-x-mipsel, ubuntu:16.04-x-arm and ubuntu:16.04-x-powerpc CC /tmp/build/perf/bench/epoll-ctl.o bench/epoll-ctl.c: In function 'init_fdmaps': bench/epoll-ctl.c:214:16: error: comparison between signed and unsigned integer expressions [-Werror=sign-compare] for (i = 0; i < nfds; i+=inc) { ^ bench/epoll-ctl.c: In function 'bench_epoll_ctl': bench/epoll-ctl.c:377:16: error: comparison between signed and unsigned integer expressions [-Werror=sign-compare] for (i = 0; i < nthreads; i++) { ^ bench/epoll-ctl.c:388:16: error: comparison between signed and unsigned integer expressions [-Werror=sign-compare] for (i = 0; i < nthreads; i++) { ^ cc1: all warnings being treated as errors Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Jason Baron <jbaron@akamai.com> Link: http://lkml.kernel.org/r/20181106152226.20883-3-dave@stgolabs.net [ Use inttypes.h to print rlim_t fields, fixing the build on Alpine Linux / musl libc ] [ Check if eventfd() is available, i.e. if HAVE_EVENTFD is defined ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-11-21perf bench: Add epoll parallel epoll_wait benchmarkDavidlohr Bueso
This program benchmarks concurrent epoll_wait(2) for file descriptors that are monitored with with EPOLLIN along various semantics, by a single epoll instance. Such conditions can be found when using single/combined or multiple queuing when load balancing. Each thread has a number of private, nonblocking file descriptors, referred to as fdmap. A writer thread will constantly be writing to the fdmaps of all threads, minimizing each threads's chances of epoll_wait not finding any ready read events and blocking as this is not what we want to stress. Full details in the start of the C file. Committer testing: # perf bench Usage: perf bench [<common options>] <collection> <benchmark> [<options>] # List of all available benchmark collections: sched: Scheduler and IPC benchmarks mem: Memory access benchmarks numa: NUMA scheduling and MM benchmarks futex: Futex stressing benchmarks epoll: Epoll stressing benchmarks all: All benchmarks # perf bench epoll # List of available benchmarks for collection 'epoll': wait: Benchmark epoll concurrent epoll_waits all: Run all futex benchmarks # perf bench epoll wait # Running 'epoll/wait' benchmark: Run summary [PID 19295]: 3 threads monitoring on 64 file-descriptors for 8 secs. [thread 0] fdmap: 0xdaa650 ... 0xdaa74c [ 328241 ops/sec ] [thread 1] fdmap: 0xdaa900 ... 0xdaa9fc [ 351695 ops/sec ] [thread 2] fdmap: 0xdaabb0 ... 0xdaacac [ 381423 ops/sec ] Averaged 353786 operations/sec (+- 4.35%), total secs = 8 # Committer notes: Fix the build on debian:experimental-x-mips, debian:experimental-x-mipsel and others: CC /tmp/build/perf/bench/epoll-wait.o bench/epoll-wait.c: In function 'writerfn': bench/epoll-wait.c:399:12: error: format '%ld' expects argument of type 'long int', but argument 2 has type 'size_t' {aka 'unsigned int'} [-Werror=format=] printinfo("exiting writer-thread (total full-loops: %ld)\n", iter); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~ bench/epoll-wait.c:86:31: note: in definition of macro 'printinfo' do { if (__verbose) { printf(fmt, ## arg); fflush(stdout); } } while (0) ^~~ cc1: all warnings being treated as errors Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Jason Baron <jbaron@akamai.com> <jbaron@akamai.com> Link: http://lkml.kernel.org/r/20181106152226.20883-2-dave@stgolabs.net Link: http://lkml.kernel.org/r/20181106182349.thdkpvshkna5vd7o@linux-r8p5> [ Applied above fixup as per Davidlohr's request ] [ Use inttypes.h to print rlim_t fields, fixing the build on Alpine Linux / musl libc ] [ Check if eventfd() is available, i.e. if HAVE_EVENTFD is defined ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-11-21perf bench: Move HAVE_PTHREAD_ATTR_SETAFFINITY_NP into bench.hDavidlohr Bueso
Both futex and epoll need this call, and can cause build failure on systems that don't have it pthread_attr_setaffinity_np(). Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Jason Baron <jbaron@akamai.com> Link: http://lkml.kernel.org/r/20181109210719.pr7ohayuwqmfp2wl@linux-r8p5 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-07-30tools arch: Update arch/x86/lib/memcpy_64.S copy used in 'perf bench mem memcpy'Arnaldo Carvalho de Melo
To cope with the changes in: 12c89130a56a ("x86/asm/memcpy_mcsafe: Add write-protection-fault handling") 60622d68227d ("x86/asm/memcpy_mcsafe: Return bytes remaining") bd131544aa7e ("x86/asm/memcpy_mcsafe: Add labels for __memcpy_mcsafe() write fault handling") da7bc9c57eb0 ("x86/asm/memcpy_mcsafe: Remove loop unrolling") This needed introducing a file with a copy of the mcsafe_handle_tail() function, that is used in the new memcpy_64.S file, as well as a dummy mcsafe_test.h header. Testing it: $ nm ~/bin/perf | grep mcsafe 0000000000484130 T mcsafe_handle_tail 0000000000484300 T __memcpy_mcsafe $ $ perf bench mem memcpy # Running 'mem/memcpy' benchmark: # function 'default' (Default memcpy() provided by glibc) # Copying 1MB bytes ... 44.389205 GB/sec # function 'x86-64-unrolled' (unrolled memcpy() in arch/x86/lib/memcpy_64.S) # Copying 1MB bytes ... 22.710756 GB/sec # function 'x86-64-movsq' (movsq-based memcpy() in arch/x86/lib/memcpy_64.S) # Copying 1MB bytes ... 42.459239 GB/sec # function 'x86-64-movsb' (movsb-based memcpy() in arch/x86/lib/memcpy_64.S) # Copying 1MB bytes ... 42.459239 GB/sec $ This silences this perf tools build warning: Warning: Kernel ABI header at 'tools/arch/x86/lib/memcpy_64.S' differs from latest version at 'arch/x86/lib/memcpy_64.S' Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Mika Penttilä <mika.penttila@nextfour.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-igdpciheradk3gb3qqal52d0@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-06-25perf bench: Fix numa report output codeJiri Olsa
Currently we can hit following assert when running numa bench: $ perf bench numa mem -p 3 -t 1 -P 512 -s 100 -zZ0cm --thp 1 perf: bench/numa.c:1577: __bench_numa: Assertion `!(!(((wait_stat) & 0x7f) == 0))' failed. The assertion is correct, because we hit the SIGFPE in following line: Thread 2.2 "thread 0/0" received signal SIGFPE, Arithmetic exception. [Switching to Thread 0x7fffd28c6700 (LWP 11750)] 0x000.. in worker_thread (__tdata=0x7.. ) at bench/numa.c:1257 1257 td->speed_gbs = bytes_done / (td->runtime_ns / NSEC_PER_SEC) / 1e9; We don't check if the runtime is actually bigger than 1 second, and thus this might end up with zero division within FPU. Adding the check to prevent this. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180620094036.17278-1-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-05-07perf bench numa: Fix typo in optionsYisheng Xie
'R' means access the data via reads instead of writes, fix this typo. Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1524644707-11030-1-git-send-email-xieyisheng1@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>