CSIT-2001 update: Xeon Skylake Performance and Progressions/Regressions RCAs


Maciek Konstantynowicz (mkonstan)
 

On 12 May 2020, at 15:18, Maciek Konstantynowicz (mkonstan) <mkonstan@...> wrote:

Dear All,

We have finally pushed out an update to CSIT-2001 report with VPP
performance data for testbeds with Intel Xeon Skylake processors (2n-skx
and 3n-skx testbeds), with SUT and TG servers impacted by firmware and
OS upgrades (BIOS, ucode, kernel updates with mitigations against the
newly discovered Spectre-Meltdown security vulnerabilities).

The updated CSIT-2001 report should be available for browsing just
before 15:00 UTC today, subject to Jenkins job execution (will have
updated version timestamp):

https://docs.fd.io/csit/rls2001/report/
https://docs.fd.io/csit/rls2001/report/vpp_performance_tests/csit_release_notes.html

In addition to 2n-skx and 3n-skx performance data available at the usual
locations in the report (see links [r1] to [r4] referenced below), we
have expanded the way we do VPP release-to-release comparisons and root
cause analysis (RCA) for any identified performance progressions and
regressions:

- CSIT test environment is now versioned, with ver. 1 associated
with CSIT rls1908 git branch as of 2019-08-21, and ver. 2
associated with CSIT master and rls2001 git branches as of
2020-03-27.

- To identify SUT performance change(s) due to CSIT test environment
change(s) from ver. 1 to ver. 2, VPP v19.08.1 has been re-tested
in ver. 2 and results compared against the past data obtained with
ver. 1. RCA1 analysis has been applied to this part. See [r5].

- To identify SUT performance change(s) due to VPP code change(s)
from v19.08.1 to v20.01.0, both VPP versions have been tested in
CSIT environment ver. 2 and results compared. Separate RCA2
analysis has been applied to this part. See [r5].

- At this stage RCA1 and RCA2 analyses are focusing on progressions > +5%
and regressions < -5%.

Attached pasted complete list of RCAs identified as part of this
exercise [1] to [12].

Hope it makes sense. For any questions and comments please contact
csit-dev@....

Regards,
Maciek
(on behalf of FD.io CSIT team)


Specific links within the report:

[r1] VPP throughput graphs,
https://docs.fd.io/csit/rls2001/report/vpp_performance_tests/packet_throughput_graphs/index.html

[r2] VPP throughput speedup multi-core,
https://docs.fd.io/csit/rls2001/report/vpp_performance_tests/throughput_speedup_multi_core/ip4-2n-skx-xxv710.html

[r3] VPP packet latency,
https://docs.fd.io/csit/rls2001/report/vpp_performance_tests/packet_latency/index.html

[r4] VPP soak tests,
https://docs.fd.io/csit/rls2001/report/vpp_performance_tests/soak_tests/index.html

[r5] 2n-skx PDR comparison with RCA,
https://docs.fd.io/csit/rls2001/report/vpp_performance_tests/comparisons/current_vs_previous_release.html#n-skx

[r6] 3n-skx PDR comparison with RCA,
https://docs.fd.io/csit/rls2001/report/vpp_performance_tests/comparisons/current_vs_previous_release.html#id1

RCA1:

[1] DONE, Impact of upgrades: i) Skx ucode from 0x2000043 to 0x2000065,
[ii) Linux kernel from 4.15.0-60 to 4.15.0-72 and iii) SuperMicro
[motherboard BIOS from 3.0c to 3.2.

[2] DONE, Applied fix of FVL NIC firmware 6.0.1 for increasing TRex pps
rate from 27 Mpps to 37 Mpps, [CSIT-1503], [TRex-519].

[3] DONE, Applied VPP PAPI fix to enable memif zero-copy, [CSIT-1592],
[VPP-1764].

[4] OPEN, Higher than before StDev of PDR throughput for VPP vhost-user
with VPP-inside-VM, under investigation, [CSIT-1699], [CSIT-1704].

RCA2:

[5] OPEN, dot1q-l2xcbase progression, retro-inspection of weekly ndrpdr
tests points to ge-22805, automated bisect script does not work
due to frequent API changes, [CSIT-1699], [CSIT-1705].

[6] DONE, ip4base-nat44 regression, ge-23963
(https://gerrit.fd.io/r/c/vpp/+/23963#message-044278e6_752c3327).

[7] WIP, avf-ip4scale regression, CANDIDATE(S) before ge-22699, [
CSIT-1699], [CSIT-1706].

[8] OPEN, VPP vhost-user with VPP-inside-VM higher than before stdev
of PDR throughput, under investigation, [CSIT-1699], [CSIT-1704].

[9] WIP, vhost-user with testpmd-in-VM progression, CANDIDATE(S)
before 22277, [CSIT-1699], [CSIT-1707].

[10] WIP, avf-ip4base regression, CANDIDATE(S) range
ge-18361..ge-24505, [CSIT-1699], [CSIT-1708].

[11] DONE, memif regression, CANDIDATE(S) confirmed ge-23801.

[12] WIP, ipsec tnl sw scale regression, CANDIDATE(S) before ge-23557,
[CSIT-1699], [CSIT-1712].


Maciek Konstantynowicz (mkonstan)
 

Dear All,

We have finally pushed out an update to CSIT-2001 report with VPP
performance data for testbeds with Intel Xeon Skylake processors (2n-skx
and 3n-skx testbeds), with SUT and TG servers impacted by firmware and
OS upgrades (BIOS, ucode, kernel updates with mitigations against the
newly discovered Spectre-Meltdown security vulnerabilities).

The updated CSIT-2001 report should be available for browsing just
before 15:00 UTC today, subject to Jenkins job execution (will have
updated version timestamp):

https://docs.fd.io/csit/rls2001/report/
https://docs.fd.io/csit/rls2001/report/vpp_performance_tests/csit_release_notes.html

In addition to 2n-skx and 3n-skx performance data available at the usual
locations in the report (see links [r1] to [r4] referenced below), we
have expanded the way we do VPP release-to-release comparisons and root
cause analysis (RCA) for any identified performance progressions and
regressions:

- CSIT test environment is now versioned, with ver. 1 associated
with CSIT rls1908 git branch as of 2019-08-21, and ver. 2
associated with CSIT master and rls2001 git branches as of
2020-03-27.

- To identify SUT performance change(s) due to CSIT test environment
change(s) from ver. 1 to ver. 2, VPP v19.08.1 has been re-tested
in ver. 2 and results compared against the past data obtained with
ver. 1. RCA1 analysis has been applied to this part. See [r5].

- To identify SUT performance change(s) due to VPP code change(s)
from v19.08.1 to v20.01.0, both VPP versions have been tested in
CSIT environment ver. 2 and results compared. Separate RCA2
analysis has been applied to this part. See [r5].

- At this stage RCA1 and RCA2 analyses are focusing on progressions > +5%
and regressions < -5%.

Attached pasted complete list of RCAs identified as part of this
exercise [1] to [12].

Hope it makes sense. For any questions and comments please contact
csit-dev@....

Regards,
Maciek
(on behalf of FD.io CSIT team)


Specific links within the report:

[r1] VPP throughput graphs,
https://docs.fd.io/csit/rls2001/report/vpp_performance_tests/packet_throughput_graphs/index.html

[r2] VPP throughput speedup multi-core,
https://docs.fd.io/csit/rls2001/report/vpp_performance_tests/throughput_speedup_multi_core/ip4-2n-skx-xxv710.html

[r3] VPP packet latency,
https://docs.fd.io/csit/rls2001/report/vpp_performance_tests/packet_latency/index.html

[r4] VPP soak tests,
https://docs.fd.io/csit/rls2001/report/vpp_performance_tests/soak_tests/index.html

[r5] 2n-skx PDR comparison with RCA,
https://docs.fd.io/csit/rls2001/report/vpp_performance_tests/comparisons/current_vs_previous_release.html#n-skx

[r6] 3n-skx PDR comparison with RCA,
https://docs.fd.io/csit/rls2001/report/vpp_performance_tests/comparisons/current_vs_previous_release.html#id1

RCA1:

[1] DONE, Impact of upgrades: i) Skx ucode from 0x2000043 to 0x2000065,
[ii) Linux kernel from 4.15.0-60 to 4.15.0-72 and iii) SuperMicro
[motherboard BIOS from 3.0c to 3.2.

[2] DONE, Applied fix of FVL NIC firmware 6.0.1 for increasing TRex pps
rate from 27 Mpps to 37 Mpps, [CSIT-1503], [TRex-519].

[3] DONE, Applied VPP PAPI fix to enable memif zero-copy, [CSIT-1592],
[VPP-1764].

[4] OPEN, Higher than before StDev of PDR throughput for VPP vhost-user
with VPP-inside-VM, under investigation, [CSIT-1699], [CSIT-1704].

RCA2:

[5] OPEN, dot1q-l2xcbase progression, retro-inspection of weekly ndrpdr
tests points to ge-22805, automated bisect script does not work
due to frequent API changes, [CSIT-1699], [CSIT-1705].

[6] DONE, ip4base-nat44 regression, ge-23963
(https://gerrit.fd.io/r/c/vpp/+/23963#message-044278e6_752c3327).

[7] WIP, avf-ip4scale regression, CANDIDATE(S) before ge-22699, [
CSIT-1699], [CSIT-1706].

[8] OPEN, VPP vhost-user with VPP-inside-VM higher than before stdev
of PDR throughput, under investigation, [CSIT-1699], [CSIT-1704].

[9] WIP, vhost-user with testpmd-in-VM progression, CANDIDATE(S)
before 22277, [CSIT-1699], [CSIT-1707].

[10] WIP, avf-ip4base regression, CANDIDATE(S) range
ge-18361..ge-24505, [CSIT-1699], [CSIT-1708].

[11] DONE, memif regression, CANDIDATE(S) confirmed ge-23801.

[12] WIP, ipsec tnl sw scale regression, CANDIDATE(S) before ge-23557,
[CSIT-1699], [CSIT-1712].