CSIT failing perf tests for week 05 (2023-01-30 – 2023-02-06)


Vratko Polak -X (vrpolak - PANTHEON TECHNOLOGIES at Cisco)
 

Hi. I am taking over after Viliam.

 

Running out of time this week, so some items are not updated.

Failures as of trending on Monday, fixes as of master on Thursday.

 

After the summary table I just copied the data from
https://wiki.fd.io/view/CSIT/TestFailuresTracking

Contrary to summary, that document

waits for trending confirmation before declaring an issue as Fixed.

 

Note: QUIC tests were not part of Trending that week.

 

Vratko.

 

=====SUMMARY=====

New Unfixed Issues: 1

New (but already) Fixed Issues: 1

Old Unfixed issues: 13

Old Fixed issues: 0

Total Unfixed Issues: 14

 

 

= Current Failures =

 

== Deterministic Failures ==

 

=== In Trending ===

 

==== (M) csit-dpdk-perf-mrr-weekly-master-3n-snr fails due to a missing symlink ====

 

* last update: 2023-02-09

* work-to-fix: easy

* rca: Missing file in CSIT git (probably an oversight).

* test: all (robot does not even start)

* testbed: 3n-snr

* frequency: always

* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-alt/155/log.html.gz#s1-s1-s1-s1-s1-t1

* ticket: [https://jira.fd.io/browse/CSIT-1894 CSIT-1894]

* gerrit: https://gerrit.fd.io/r/c/csit/+/38197

* note: Waiting weekly run to confirm this got fixed by 38197.

 

==== (M) wrong MAC address on lf_2n_clx_testbed27.yaml ====

 

* last update: 2023-02-09

* work-to-fix: easy

* rca: typo in topology yaml file

* test: mlx5 relying on MAC. Affected: memif, vhost, l2bd. Not affected: ip4, ip6, dot1q, other L2.

* testbed: 2n-clx, only the first testbed out of three in lab

* frequency: always, unless other 2n-clx testbed is reserved

* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-clx/1270/log.html.gz#s1-s1-s1-s1-s3-t1-k2-k9-k1-k1-k1-k21

* ticket: [https://jira.fd.io/browse/CSIT-1893 CSIT-1893]

* note: Peter Mikus will look at this.

 

==== (M) 3n-snr: All hwasync wireguard tests failing when trying to verify device ====

 

* last update: before 2023-01-31

* work-to-fix: hard

* rca: Missing QAT driver. Symptom: Failed to bind PCI device 0000:f4:00.0 to c4xxx on host 10.30.51.93

* test: hwasync wireguard

* frequency: always

* testbed: 3n-snr

* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-snr/95/log.html.gz#s1-s1-s1-s3-s1 3n-snr]

* ticket: [https://jira.fd.io/browse/CSIT-1883 CSIT-1883]

 

==== (M) 1n-aws: TRex mlrsearch fails to find NDR & PDR due to AWS rate limiting (5min total test duration) ====

 

* last update: 2023-02-09

* work-to-fix: hard

* rca:

* test: ip4scale2m

* frequency: always

* testbed: 1n-aws

* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-trex-perf-ndrpdr-weekly-master-1n-aws/18/log.html.gz#s1-s1-s1-s1-s2-t1 1n-aws]

* ticket: [https://jira.fd.io/browse/CSIT-1876 CSIT-1876]

* note: The root cause can be shared environment in aws cloud. We may need to use a smaller scale there.

 

==== (M) 3n-alt, 3n-snr: testpmd no traffic forwarded ====

 

* last update: 2023-02-09

* work-to-fix: medium

* rca: DUT-DUT link takes too long to come up on some testbeds. This happens *after* a test case with a DPDK app (not VPP even when using dpdk plugin), although multiple subsequent tests (even with VPP) may be affected. The real cause is probably in NIC firmware or driver, but CSIT can be better at detecting port status as a workaround.

* test: testpmd (also l3fwd but hidden by CSIT-1896)

* frequency: always (almost)

* testbed: 3n-alt, 3n-snr

* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-mrr-weekly-master-3n-alt/42/log.html.gz#s1-s1-s1-s1-t1 3n-alt], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-report-iterative-2210-3n-snr/6/log.html.gz#s1-s1-s1-s1-t1 3n-snr], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-report-iterative-2210-3n-snr/14/log.html.gz#s1-s1-s1-s1-t1 3n-snr]

* ticket: [https://jira.fd.io/browse/CSIT-1848 CSIT-1848]

 

==== (M) 3n-alt: Tests failing until 40Ge Interface comes up ====

 

* last update: 2023-02-09

* work-to-fix: medium

* rca: DUT-DUT link takes too long to come up due to CSIT-1848.

* test: first tests in order

* frequency: always (almost, depends on run order)

* testbed: 3n-alt (3n-snr link does not take that long)

* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-alt/155/log.html.gz#s1-s1-s1-s1-s1-t1

* ticket: [https://jira.fd.io/browse/CSIT-1890 CSIT-1890]

 

=== Not In Trending ===

 

==== (H) 3n-icx: vpp hoststack QUIC vppecho tests failing ====

 

* last update: 2023-02-09

* work-to-fix: easy

* rca:

* test: Quic vppecho BPS

* frequency: always

* testbed: 3n-skx, 3n-icx

* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2210-3n-icx/17/log.html.gz#s1-s1-s1-s1-s5-t1 3n-icx]

* ticket: [https://jira.fd.io/browse/CSIT-1835 CSIT-1835]

* gerrit: https://gerrit.fd.io/r/c/csit/+/38085

* note: Will be part of daily hoststack job, waiting to see it there to confirm 38085 is the fix.

 

==== (M) all testbeds: some vpp 9000B tests ====

 

* last update: 2023-02-09

* work-to-fix: hard

* rca: VPP code: [https://gerrit.fd.io/r/c/vpp/+/34839 34839: dpdk: cleanup MTU handling]. CSIT needs to rework how it sets MTU / max frame rate (CSIT-1797). Some tests will continue failing due to missing support on VPP side, we will open specific Jira tickets for those.

* test: see sub-items

* frequency: always

* testbed: all

* examples: see sub-items

* ticket: [https://jira.fd.io/browse/CSIT-1809 CSIT-1809]

* gerrit: https://gerrit.fd.io/r/c/csit/+/37824

 

===== (M) tests with 9000B payload frames not forwarded over vhost interfaces =====

 

* last update: 2023-02-09

* work-to-fix: hard

* test: 9000B + vhostuser

* testbed: 2n-skx, 3n-skx, 2n-clx

* examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2202-3n-skx/67/log.html.gz#s1-s1-s1-s1-s1 3n-skx vhostuser]

* ticket: [https://jira.fd.io/browse/CSIT-1809 CSIT-1809]

 

===== tests with 9000B payload frames not forwarded over memif interfaces =====

 

* last update: 2023-02-09

* work-to-fix: hard

* test: 9000B + memif

* testbed: 2n-skx, 3n-skx, 2n-clx

* examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2202-2n-skx/33/log.html.gz#s1-s1-s1-s1-s1 2n-skx Memif]

* ticket: [https://jira.fd.io/browse/CSIT-1808 CSIT-1808]

 

===== 9000B payload frames not forwarded over tunnels due to violating supported Max Frame Size (VxLAN, LISP, SRv6) =====

 

* last update: 2023-02-09

* work-to-fix: medium

* test: 9000B + (IP4 tunnels VXLAN, IP4 tunnels LISP, Srv6, IpSec)

* testbed: 2n-icx, 3n-icx

* examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-2n-icx/10/log.html.gz 2n-icx VXLAN], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-3n-icx/22/log.html.gz#s1-s1-s1-s1-s1-t6 3n-icx]

* ticket: [https://jira.fd.io/browse/CSIT-1801 CSIT-1801]

 

===== (M) 9000b all AVF tests are failing to forward traffic =====

 

* last update: 2023-02-09

* work-to-fix: hard

* test: 9000B + AVF

* testbed: 3n-icx

* examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-3n-icx/13/log.html.gz#s1-s1-s1-s1-s1-t6 3n-icx ip4base]

* ticket: [https://jira.fd.io/browse/CSIT-1885 CSIT-1885]

 

==== (M) 2n-clx, 2n-icx, 2n-zn2: DPDK testpmd 9000b tests on xxv710 nic are failing with no traffic ====

 

* last update: 2023-02-09

* work-to-fix: medium

* rca: The DPDK app only attempts to set MTU once, but if interface is down (CSIT-1848) it fails. As a workaround, MTU could be set on Linux interface before starting the DPDK app.

* test: DPDK testpmd 9000b

* frequency: always

* testbed: 2n-clx, 2n-icx, 2n-zn2

* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-report-iterative-2210-2n-clx/1/log.html.gz#s1-s1-s1-s3-t6 2n-clx], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-report-iterative-2210-2n-icx/3/log.html.gz#s1-s1-s1-s1-t6 2n-icx]

* ticket: [https://jira.fd.io/browse/CSIT-1870 CSIT-1870]

* note: Vratko will fix, either in general workaround for CSIT-1848 or in a separate change.

 

==== (M) 2n-clx, 2n-icx: all Geneve tests with 1024 tunnels fail ====

 

* last update: before 2023-01-31

* work-to-fix: hard

* rca: VPP crash, Failed to add IP neighbor on interface geneve_tunnel258

* test: avf-ethip4--ethip4udpgeneve-1024tun-ip4base 64B 1518B IMIX 1c 2c 4c

* frequency: always

* testbed: 2n-skx, 2n-clx, 2n-icx

* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-2n-icx/10/log.html.gz#s1-s1-s1-s1-s1 2n-icx]

* ticket: [https://jira.fd.io/browse/CSIT-1800 CSIT-1800]

 

==== (L) 2n-clx, 2n-icx: nat44ed cps 16M sessions scale fail ====

 

* last update: before 2023-01-31

* work-to-fix: hard

* rca: VPP crash, Failed to set NAT44 address range on host 10.30.51.44 (connections-per-second tests only)

* test: 64B-avf-ethip4tcp-nat44ed-h262144-p63-s16515072-cps-ndrpdr 1c 2c 4c, 64B-avf-ethip4udp-nat44ed-h262144-p63-s16515072-cps-ndrpdr 1c 2c 4c

* frequency: always

* testbeds: 2n-skx, 2n-clx, 2n-icx

* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-2n-icx/18/log.html.gz#s1-s1-s1-s1-s11-t3 2n-icx], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-2n-clx/9/log.html.gz#s1-s1-s1-s1-s11-t1 2n-clx]

* ticket: [https://jira.fd.io/browse/CSIT-1799 CSIT-1799]

 

==== (L) 2n-clx, 2n-icx: nat44det imix 1M sessions fails to create sessions ====

 

* last update: before 2023-01-31

* work-to-fix: hard

* rca:

* test: IMIX over 1M sessions bidir

* frequency: always

* testbed: 2n-skx, 2n-clx, 2n-icx

* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-2n-icx/18/log.html.gz#s1-s1-s1-s1-s2-t4 2n-icx]

* ticket: [https://jira.fd.io/browse/CSIT-1884 CSIT-1884]

 

== Occasional Failures ==

 

=== In Trending ===

 

==== (H) 2n-icx: NFV density VPP does not start in container ====

 

* last update: before 2023-01-31

* work-to-fix: hard

* rca:

* test: all subsequent

* frequency: medium

* testbed: 2n-icx

* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-weekly-master-2n-icx/57/log.html.gz 2n-icx mrr], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-icx/48/log.html.gz#s1-s1-s1-s5-s8-t1 2n-icx ndrpdr]

* ticket: [https://jira.fd.io/browse/CSIT-1881 CSIT-1881]

* note: Once VPP breaks, all subsequent tests fail. Even all subsequent builds will be failing until Peter makes TB working again. Although it's failing with medium frequency when it happens it breaks all subsequent builds on the TB therefore [H] priority.

 

==== (M) 2n-clx: e810 mlrsearch tests packets forwarding in one direction ====

 

* last update: before 2023-01-31

* work-to-fix: hard

* rca:

* test: e810Cq ip4base, ip6base

* frequency: high

* testbed: 2n-clx

* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-clx/176/log.html.gz#s1-s1-s1-s2-s8-t1 2n-clx]

* ticket: [https://jira.fd.io/browse/CSIT-1864 CSIT-1864]

 

==== (M) 3n-icx, 3n-snr: wireguard 100 and 1000 tunnels mlrsearch tests failing with 2c and 4c ====

 

* last update: before 2023-01-31

* work-to-fix: easy

* rca:

* test: wireguard 100 tunnels and more

* frequency: high

* testbed: 3n-icx, 3n-snr

* examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-3n-icx/56/log.html.gz#s1-s1-s1-s3-s8-t4 3n-icx]

* ticket: [https://jira.fd.io/browse/CSIT-1886 CSIT-1886]

 

==== (M) 3n-tsh: vpp in VM not starting ====

 

* last update: before 2023-01-31

* work-to-fix: easy

* rca:

* test: 3n-tsh: sporadic VM vhost

* frequency: high

* testbed: 3n-tsh

* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-tsh/738/log.html.gz#s1-s1-s1-s7-s2-t1 3n-tsh], [https://jenkins.fd.io/view/csit/job/csit-vpp-perf-verify-master-3n-tsh/123/ 3n-tsh]

* ticket: [https://jira.fd.io/browse/CSIT-1877 CSIT-1877]

* note: 3n-alt testbed was fixed. 3n-tsh still failing. fixed: by rebuild initrd .37 on TB,

 

== Rare Failures ==

 

=== In Trending ===

 

==== (M) 3n-icx, 3n-snr: 1518B IPsec packets not passing ====

 

* last update: before 2023-01-31

* work-to-fix: hard

* rca:

* test: all AVF crypto

* frequency: low

* testbed: 3n-skx, 3n-icx, 3n-snr

* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-icx/197/log.html.gz#s1-s1-s1-s1-s4-t1 3n-icx daily], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-snr/32/log.html.gz#s1-s1-s1-s1-s4-t1 3n-snr], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-3n-icx/57/log.html.gz#s1-s1-s1-s1-s4-t1 3n-icx weekly]

* ticket: [https://jira.fd.io/browse/CSIT-1827 CSIT-1827]

 

==== (M) all testbeds: mlrsearch fails to find NDR rate ====

 

* last update: before 2023-01-31

* work-to-fix: hard

* rca:

* test: Crypto, Ip4, L2, Srv6, Vm Vhost (all packet sizes, all core configurations affected)

* frequency: low

* testbed: 3n-tsh, 3n-alt, 2n-clx

* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-icx/57/log.html.gz#s1-s1-s1-s2-s37-t2 2n-icx]

* ticket: [https://jira.fd.io/browse/CSIT-1804 CSIT-1804]

 

==== (M) all testbeds: AF_XDP mlrsearch fails to find NDR rate ====

 

* last update: before 2023-01-31

* work-to-fix: hard

* rca:

* test: af-xdp multicore tests

* frequency: low

* testbed: 2n-clx, 2n-skx, 2n-tx2, 2n-icx

* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-skx/202/log.html.gz#s1-s1-s1-s2-s4-t3 2n-skx], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-clx/152/log.html.gz#s1-s1-s1-s5-s12-t3 2n-clx]

* ticket: [https://jira.fd.io/browse/CSIT-1802 CSIT-1802]

* note: This is mainly observed in iterative and coverage. It's very low frequency ~ 1 out of 100

 

==== (L) all testbeds: vpp create avf interface failure in multi-core configs ====

 

* last update: 2023-02-06

* work-to-fix: hard

* rca: issue in Intel FVL driver

* test: multicore AVF

* frequency: low

* testbed: all testbeds

* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-clx/1257/log.html.gz#s1-s1-s1-s5-s24-t2 2n-clx], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-icx/197/log.html.gz#s1-s1-s1-s5-s1-t3 3n-icx]

* ticket: [https://jira.fd.io/browse/CSIT-1782 CSIT-1782]

* note: A long standing issue without a final permanent fix.

 

==== (L) all testbeds: nat44det 4M and 16M scale 1 session not established ====

 

* last update: 2023-02-06

* work-to-fix: hard

* rca: unknown

* test: nat44det udp 4m and 16m (64k and 1m are ok)

* frequency: low

* testbed: 2n-zn2, 2n-skx, 2n-icx, 2n-clx

* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-zn2/672/log.html.gz#s1-s1-s1-s2-s22-t3 2n-zn2], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-clx/164/log.html.gz#s1-s1-s1-s2-s54-t1 2n-clx]

* ticket: [https://jira.fd.io/browse/CSIT-1795 CSIT-1795]

 

= Past Failures =

Join {csit-report@lists.fd.io to automatically receive all group messages.