Topics

Docker Hub Rate Limit issue RESOLVED (was: [vpp-dev] Jenkins jobs UNSTABLE due to failure to upload logs to nexus.fd.io)


Dave Wallace
 

Folks,

The Docker Hub Rate Limit issue has been resolved (details below) and the FD.io CI jobs are operating normally.  Please let me know if you encounter any errant failure signatures.

Thanks to Vanessa & Trishan for their help resolving the outage.

Cheers,
-daw-

---- %< ----
Docker Hub Rate Limit Resolution:

It turns out I misunderstood their rate limiting scheme -- the limit is imposed on anonymous & unauthorized docker id based pull requests, not on the repository accounts. Therefore we needed to create an authenticated account, add it to the 'fdiotools' 'users' team and then configure Nomad to login with the docker id for pull requests from the 'fdiotools' repositories in order to avoid the rate limit.

Vanessa & Trishan created an fd.io email account and docker account which were then added to all of the Jenkins.fd.io Nomad Plugin configuration templates for all FD.io projects.  Nomad is now successfully issuing docker pull requests and spinning up CI job executors at the request of jenkins.fd.io!

Life is good :)
---- %< ----

On 11/18/2020 6:43 PM, Dave Wallace via lists.fd.io wrote:
Folks,

IT-21051 was resolved by Vanessa's ci-management patch [0] while [nearly] simultaneously two patches [1] [2] from Andrew Y were deployed which remove the artifact publishing from the VPP CI jobs.  These changes were subsequently reverted [3].

Operation of VPP CI jobs has been restored and I have done a 'recheck' on all gerrit changes which previously failed due to the UNSTABLE job completion status.

Unfortunately, there is a new issue caused by hitting the Docker Hub Pull limit [4] which is causing job allocations to fail and the jenkins build queue to back up.  I have opened a new LF Help Desk Ticket [4], sent an email to the TSC, and will bring this up in tomorrow's TSC meeting to get it resolved.

There also appears to be a similar issue with the vpp-csit-verify-device-master-1n-skx job which has jobs failing due to the inability to start containers.

Thank you for your patience during this outage and thanks to Vanessa & the entire LF-IT team who worked on identifying the fix to the log upload issue.  Also a big thank you to Andrew Yourtchenko for his assistance in pushing ci-management patches and Vratko for ci-management patch reviews.

-daw-

[0] https://gerrit.fd.io/r/c/ci-management/+/29986
[1] https://gerrit.fd.io/r/c/ci-management/+/29985
[2]
https://gerrit.fd.io/r/c/ci-management/+/29987
[3] https://gerrit.fd.io/r/c/ci-management/+/29988
[4] https://jira.linuxfoundation.org/plugins/servlet/theme/portal/2/IT-21063

On 11/17/2020 12:38 PM, Dave Wallace via lists.fd.io wrote:
Folks,

There is an issue with CI jobs being marked as UNSTABLE due to the failure to upload log files to nexus.fd.io.  This is causing the CI job pipeline to be stalled due to checkstyle job not succeeding.

I have opened a case with LF-IT: https://jira.linuxfoundation.org/plugins/servlet/theme/portal/2/IT-21051

Thanks,
-daw-