Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] - Existing certificate secret name is not being picked up.

See original GitHub issue

Describe the bug

When using a existing certificate type existing and providing secret name in the nebari config file, the secret name isn’t being passed into the manifested terraform files.

Expected behavior

We should be to utilize an existing secret for our tls certificate.

OS and architecture in which you are running Nebari

AWS, EKS

How to Reproduce the problem?

When adding this code block in our nebari config file and then running this command nebari deploy -c nebari-config.yaml --dns-provider cloudflare --dns-auto-provision

certificate:
  type: existing
  secret_name: my-tls-certificate-secret

Command output

[terraform]:   # module.kubernetes-ingress.kubernetes_manifest.tlsstore_default[0] will be updated in-place
[terraform]:   ~ resource "kubernetes_manifest" "tlsstore_default" {
[terraform]:       ~ object   = {
[terraform]:           ~ spec       = {
[terraform]:               ~ defaultCertificate = {
[terraform]:                   ~ secretName = "my-tls-certificate-secret" -> ""
[terraform]:                 }
[terraform]:             }
[terraform]:             # (3 unchanged elements hidden)
[terraform]:         }
[terraform]:         # (1 unchanged attribute hidden)
[terraform]:     }

Versions and dependencies used.

No response

Compute environment

AWS

Integrations

No response

Anything else?

Researching past changes, appears issue might originated from this change https://github.com/nebari-dev/nebari/pull/1421/files#diff-a77d45450ea0a454e8b346914975631f93dc83a04d3886ec9e99977703a93164L191.

Issue Analytics

State:
Created 10 months ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

sblair-metrostarcommented, Dec 7, 2022

The secret is there, everything works so long as the TLSStore resource is updated to look for it. It just looks like the secret_name parameter from nebari_config.yaml isn’t being passed through to Terraform so it’s always empty after an apply.

apiVersion: traefik.containo.us/v1alpha1
kind: TLSStore
metadata:
  creationTimestamp: "2022-12-06T15:34:48Z"
  generation: 2
  name: default
  namespace: onyx
  resourceVersion: "39052"
  uid: 5db2cdbb-a6a8-49b4-a116-c1a7eaa13317
spec:
  defaultCertificate:
    secretName: metrostar-certificate

That secretName is an empty string after Terraform runs, even though it is an input variable that based on everything I can find in the code and docs as well as known working behavior in prior versions of QHub would work as expected. It simply appears to be getting dropped somewhere in the middle at this point, my best guess is the change made here which was the last place I found in the history where that variable was explicitly populated. @abilal-mss and I observed the same behavior on a fresh nebari install yesterday so I don’t think it has anything necessarily to do with the environment that experienced all the volume issues, nor can I guess how those could be related but crazier things have happened.

Also as a semi-related but way less important potential bug with that TLSStore, there’s a condition on its creation that suggests it’s not meant to be there at all when the certificate-secret-name variable is null. But that variable is being defaulted to an empty string further upstream so I’m not sure it ever is null, or at least I haven’t seen a case when that condition is evaluating to false in my testing. Not sure it matters but just fyi.

Regarding the volume mounting issue, for the first day or so it was only the conda-store nfs volume failing to mount which was preventing Jupyter single user pods as well as dask workers for any existing user environments from spinning up. It wasn’t 100% failing at first, started off with intermittent failures then gradually increased in consistency throughout the day to the point that no new pods would schedule. Besides the obvious investigations of pod logs for everything that seemed relevant, checking the node available storage, confirming the conda nfs service was running and accessible, redeploying pretty much everything in the nebari namespace, wasn’t having much luck pinning down a cause.

When it reached the point that everything was failing I was mostly in fire-fighting mode and just had to get things back up, which is when I started terminating nodes to get fresh instances. Whether that fixed it or a reboot of the node would have sufficed I can’t say, but I’m not sure there’s really much difference at the end of the day.

0reactions

pavithraescommented, Dec 13, 2022

Thanks for the additional context, @sblair-metrostar!

@viniciusdc @eskild – @sblair-metrostar’s analysis of the issue and its origination sounds valid to me, what do you think?