question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to resolve nameservices for HA HDFS when HA HDFS in kubernetes

See original GitHub issue

Alluxio Version: What version of Alluxio are you using? 2.4.1-1

Describe the bug A clear and concise description of what the bug is. 1、High availability HDFS is deployed in Kubernetes with the following configuration file core-site.txt hdfs-site.txt

2、Allxuxio starts the master using configMap as shown below alluxio-configmap.txt -Dalluxio.master.mount.table.root.ufs=hdfs://hdfs-k8s/alluxio

3、But the Alluxio master has an error,java.net.UnknownHostException: hdfs-k8s

2021-03-08 08:10:20,883 WARN  MetricRegistriesImpl - First MetricRegistry has been created without registering reporters. You may need to call MetricRegistries.global().addReportRegistration(...) before.
2021-03-08 08:10:20,885 INFO  RaftJournalSystem - Performing catchup. Last applied SN: 4. Catchup ID: -3575425480893704492
2021-03-08 08:10:20,886 INFO  RaftServerConfigKeys - raft.server.write.element-limit = 4096 (default)
2021-03-08 08:10:20,887 INFO  RaftServerConfigKeys - raft.server.write.byte-limit = 167772160 (custom)
2021-03-08 08:10:20,890 INFO  RaftJournalSystem - Exception submitting term start entry: java.util.concurrent.ExecutionException: org.apache.ratis.protocol.LeaderNotReadyException: alluxio-master-1_19200@group-ABB3109A44C1 is in LEADER state but not ready yet.
2021-03-08 08:10:20,894 INFO  RaftServerConfigKeys - raft.server.watch.timeout = 10s (default)
2021-03-08 08:10:20,895 INFO  RaftServerConfigKeys - raft.server.watch.timeout.denomination = 1s (default)
2021-03-08 08:10:20,896 INFO  RaftServerConfigKeys - raft.server.watch.element-limit = 65536 (default)
2021-03-08 08:10:20,908 INFO  RaftServerConfigKeys - raft.server.log.appender.snapshot.chunk.size.max = 16MB (=16777216) (default)
2021-03-08 08:10:20,908 INFO  RaftServerConfigKeys - raft.server.log.appender.buffer.byte-limit = 10485760 (custom)
2021-03-08 08:10:20,908 INFO  RaftServerConfigKeys - raft.server.log.appender.buffer.element-limit = 0 (default)
2021-03-08 08:10:20,913 INFO  GrpcConfigKeys - raft.grpc.server.leader.outstanding.appends.max = 128 (default)
2021-03-08 08:10:20,913 INFO  RaftServerConfigKeys - raft.server.rpc.request.timeout = 5000ms (custom)
2021-03-08 08:10:20,914 INFO  RaftServerConfigKeys - raft.server.log.appender.install.snapshot.enabled = false (custom)
2021-03-08 08:10:20,914 INFO  RatisMetrics - Creating Metrics Registry : ratis_grpc.log_appender.alluxio-master-1_19200@group-ABB3109A44C1
2021-03-08 08:10:20,914 WARN  MetricRegistriesImpl - First MetricRegistry has been created without registering reporters. You may need to call MetricRegistries.global().addReportRegistration(...) before.
2021-03-08 08:10:20,919 INFO  RaftServerConfigKeys - raft.server.log.appender.snapshot.chunk.size.max = 16MB (=16777216) (default)
2021-03-08 08:10:20,919 INFO  RaftServerConfigKeys - raft.server.log.appender.buffer.byte-limit = 10485760 (custom)
2021-03-08 08:10:20,919 INFO  RaftServerConfigKeys - raft.server.log.appender.buffer.element-limit = 0 (default)
2021-03-08 08:10:20,919 INFO  GrpcConfigKeys - raft.grpc.server.leader.outstanding.appends.max = 128 (default)
2021-03-08 08:10:20,920 INFO  RaftServerConfigKeys - raft.server.rpc.request.timeout = 5000ms (custom)
2021-03-08 08:10:20,920 INFO  RaftServerConfigKeys - raft.server.log.appender.install.snapshot.enabled = false (custom)
2021-03-08 08:10:20,922 INFO  RoleInfo - alluxio-master-1_19200: start LeaderState
2021-03-08 08:10:20,936 INFO  SegmentedRaftLogWorker - alluxio-master-1_19200@group-ABB3109A44C1-SegmentedRaftLogWorker: Rolling segment log-47_50 to index:50
2021-03-08 08:10:20,946 INFO  SegmentedRaftLogWorker - alluxio-master-1_19200@group-ABB3109A44C1-SegmentedRaftLogWorker: Rolled log segment from /journal/raft/02511d47-d67c-49a3-9011-abb3109a44c1/current/log_inprogress_47 to /journal/raft/02511d47-d67c-49a3-9011-abb3109a44c1/current/log_47-50
2021-03-08 08:10:21,090 INFO  SegmentedRaftLogWorker - alluxio-master-1_19200@group-ABB3109A44C1-SegmentedRaftLogWorker: created new log segment /journal/raft/02511d47-d67c-49a3-9011-abb3109a44c1/current/log_inprogress_51
2021-03-08 08:10:21,890 INFO  RaftJournalSystem - Performing catchup. Last applied SN: 4. Catchup ID: -7761596533413317960
2021-03-08 08:10:41,917 INFO  RaftJournalSystem - Caught up in 21032ms. Last sequence number from previous term: 4.
2021-03-08 08:10:41,923 INFO  AbstractMaster - MetricsMaster: Starting primary master.
2021-03-08 08:10:41,925 INFO  MetricsSystem - Reset all metrics in the metrics system in 1ms
2021-03-08 08:10:41,925 INFO  MetricsStore - Cleared the metrics store and metrics system in 1 ms
2021-03-08 08:10:41,926 INFO  AbstractMaster - BlockMaster: Starting primary master.
2021-03-08 08:10:41,927 INFO  AbstractMaster - FileSystemMaster: Starting primary master.
2021-03-08 08:10:41,928 INFO  DefaultFileSystemMaster - Starting fs master as primary
2021-03-08 08:10:41,948 INFO  AbstractMaster - MetaMaster: Starting primary master.
2021-03-08 08:10:41,971 INFO  DefaultMetaMaster - Detected existing cluster ID 0efda228-6f86-4bb4-b467-3dc68899d970
2021-03-08 08:10:41,998 ERROR HeartbeatThread - Uncaught exception in heartbeat executor, Heartbeat Thread shutting down
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs-k8s
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2051)
	at com.google.common.cache.LocalCache.get(LocalCache.java:3951)
	at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3974)
	at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4958)
	at alluxio.underfs.hdfs.HdfsUnderFileSystem.getFs(HdfsUnderFileSystem.java:811)
	at alluxio.underfs.hdfs.HdfsUnderFileSystem.getSpace(HdfsUnderFileSystem.java:388)
	at alluxio.underfs.UnderFileSystemWithLogging$26.call(UnderFileSystemWithLogging.java:595)
	at alluxio.underfs.UnderFileSystemWithLogging$26.call(UnderFileSystemWithLogging.java:592)
	at alluxio.underfs.UnderFileSystemWithLogging.call(UnderFileSystemWithLogging.java:1208)
	at alluxio.underfs.UnderFileSystemWithLogging.getSpace(UnderFileSystemWithLogging.java:592)
	at alluxio.master.file.DefaultFileSystemMaster$Metrics.lambda$registerGauges$3(DefaultFileSystemMaster.java:4368)
	at alluxio.master.file.DefaultFileSystemMaster$TimeSeriesRecorder.heartbeat(DefaultFileSystemMaster.java:4137)
	at alluxio.heartbeat.HeartbeatThread.run(HeartbeatThread.java:119)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs-k8s
	at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378)
	at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310)
	at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
	at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:678)
	at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:619)
	at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
	at alluxio.underfs.hdfs.HdfsUnderFileSystem$1.load(HdfsUnderFileSystem.java:169)
	at alluxio.underfs.hdfs.HdfsUnderFileSystem$1.load(HdfsUnderFileSystem.java:155)
	at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3529)
	at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2278)
	at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2155)
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2045)
	... 17 more
Caused by: java.net.UnknownHostException: hdfs-k8s
	... 32 more
2021-03-08 08:10:42,013 INFO  BackupTracker - Resetting backup tracker.
2021-03-08 08:10:42,015 INFO  BackupLeaderRole - Creating backup-leader role.
2021-03-08 08:10:42,015 INFO  AbstractMaster - TableMaster: Starting primary master.
2021-03-08 08:10:42,017 INFO  AlluxioMasterProcess - All masters started
2021-03-08 08:10:42,022 INFO  MetricsSystem - Starting sinks with config: {}.
2021-03-08 08:10:42,022 INFO  AlluxioMasterProcess - Alluxio master web server version 2.4.1-1 starting (gained leadership). webAddress=/0.0.0.0:19999
2021-03-08 08:10:42,049 INFO  log - Logging initialized @68234ms to org.eclipse.jetty.util.log.Slf4jLog
2021-03-08 08:10:42,357 INFO  WebServer - Alluxio Master Web service starting @ /0.0.0.0:19999
2021-03-08 08:10:42,360 INFO  Server - jetty-9.4.31.v20200723; built: 2020-07-23T17:57:36.812Z; git: 450ba27947e13e66baa8cd1ce7e85a4461cacc1d; jvm 1.8.0_212-b04
2021-03-08 08:10:42,413 INFO  ContextHandler - Started o.e.j.s.ServletContextHandler@b4836c5{/metrics/prometheus,null,AVAILABLE}
2021-03-08 08:10:42,414 INFO  ContextHandler - Started o.e.j.s.ServletContextHandler@7048da95{/metrics/json,null,AVAILABLE}
2021-03-08 08:10:42,416 WARN  SecurityHandler - ServletContext@o.e.j.s.ServletContextHandler@5b1b1956{/,file:///opt/alluxio-2.4.1-1/webui/master/build/,STARTING} has uncovered http methods for path: /
2021-03-08 08:11:00,293 INFO  ContextHandler - Started o.e.j.s.ServletContextHandler@5b1b1956{/,file:///opt/alluxio-2.4.1-1/webui/master/build/,AVAILABLE}
2021-03-08 08:11:00,311 INFO  AbstractConnector - Started ServerConnector@4081b016{HTTP/1.1, (http/1.1)}{0.0.0.0:19999}
2021-03-08 08:11:00,311 INFO  Server - Started @86496ms
2021-03-08 08:11:00,311 INFO  WebServer - Alluxio Master Web service started @ /0.0.0.0:19999
2021-03-08 08:11:00,333 INFO  AlluxioMasterProcess - Alluxio master version 2.4.1-1 started (gained leadership). bindAddress=/0.0.0.0:19998, connectAddress=alluxio-master-1:19998, webAddress=/0.0.0.0:19999
2021-03-08 08:11:00,335 INFO  AlluxioMasterProcess - Starting Alluxio master gRPC server on address /0.0.0.0:19998
2021-03-08 08:11:00,502 INFO  MasterProcess - registered service METRICS_MASTER_CLIENT_SERVICE
2021-03-08 08:11:00,700 INFO  MasterProcess - registered service BLOCK_MASTER_CLIENT_SERVICE
2021-03-08 08:11:00,700 INFO  MasterProcess - registered service BLOCK_MASTER_WORKER_SERVICE
2021-03-08 08:11:01,696 INFO  MasterProcess - registered service FILE_SYSTEM_MASTER_JOB_SERVICE
2021-03-08 08:11:01,697 INFO  MasterProcess - registered service FILE_SYSTEM_MASTER_WORKER_SERVICE
2021-03-08 08:11:01,698 INFO  MasterProcess - registered service FILE_SYSTEM_MASTER_CLIENT_SERVICE
2021-03-08 08:11:01,842 INFO  MasterProcess - registered service META_MASTER_CONFIG_SERVICE
2021-03-08 08:11:01,842 INFO  MasterProcess - registered service META_MASTER_BACKUP_MESSAGING_SERVICE
2021-03-08 08:11:01,842 INFO  MasterProcess - registered service RAFT_JOURNAL_SERVICE
2021-03-08 08:11:01,842 INFO  MasterProcess - registered service META_MASTER_CLIENT_SERVICE
2021-03-08 08:11:01,842 INFO  MasterProcess - registered service META_MASTER_MASTER_SERVICE
2021-03-08 08:11:01,891 INFO  MasterProcess - registered service TABLE_MASTER_CLIENT_SERVICE
2021-03-08 08:11:01,963 INFO  DefaultSafeModeManager - Rpc server started, waiting 5000ms for workers to register
2021-03-08 08:11:01,964 INFO  AlluxioMasterProcess - Started Alluxio master gRPC server on address alluxio-master-1:19998
2021-03-08 08:11:01,972 INFO  FaultTolerantAlluxioMasterProcess - Primary started
2021-03-08 08:11:02,563 WARN  DefaultBlockMaster - Could not find worker id: 4512984809611543378 for heartbeat.
2021-03-08 08:11:02,628 INFO  DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=192.168.1.116, containerHost=172.31.141.184, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=192.168.1.116, rack=null)} id: 7934004638968114946
2021-03-08 08:11:02,678 INFO  DefaultBlockMaster - registerWorker(): MasterWorkerInfo{id=7934004638968114946, workerAddress=WorkerNetAddress{host=192.168.1.116, containerHost=172.31.141.184, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=192.168.1.116, rack=null)}, capacityBytes=34359738368, usedBytes=0, lastUpdatedTimeMs=1615191062678, blocks=[], lostStorage={}}
2021-03-08 08:11:02,807 INFO  DefaultMetaMaster - getMasterId(): MasterAddress: alluxio-master-0:19998 id: 2798209153424597338
2021-03-08 08:11:02,864 INFO  DefaultMetaMaster - registerMaster(): master: MasterInfo{id=2798209153424597338, address=alluxio-master-0:19998, lastUpdatedTimeMs=1615191062862}
2021-03-08 08:11:03,782 WARN  DefaultBlockMaster - Could not find worker id: 5061862824333568095 for heartbeat.
2021-03-08 08:11:03,801 INFO  DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=192.168.1.117, containerHost=172.31.228.236, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=192.168.1.117, rack=null)} id: 951288178340284032
2021-03-08 08:11:03,816 INFO  DefaultBlockMaster - registerWorker(): MasterWorkerInfo{id=951288178340284032, workerAddress=WorkerNetAddress{host=192.168.1.117, containerHost=172.31.228.236, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=192.168.1.117, rack=null)}, capacityBytes=34359738368, usedBytes=0, lastUpdatedTimeMs=1615191063815, blocks=[], lostStorage={}}
2021-03-08 08:11:04,544 INFO  DefaultMetaMaster - getMasterId(): MasterAddress: alluxio-master-2:19998 id: 6474051121844857814
2021-03-08 08:11:04,584 INFO  DefaultMetaMaster - registerMaster(): master: MasterInfo{id=6474051121844857814, address=alluxio-master-2:19998, lastUpdatedTimeMs=1615191064583}
2021-03-08 08:11:04,726 WARN  DefaultBlockMaster - Could not find worker id: 7799114670528034177 for heartbeat.
2021-03-08 08:11:04,747 INFO  DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=192.168.1.115, containerHost=172.31.134.240, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=192.168.1.115, rack=null)} id: 243034216886158838
2021-03-08 08:11:04,764 INFO  DefaultBlockMaster - registerWorker(): MasterWorkerInfo{id=243034216886158838, workerAddress=WorkerNetAddress{host=192.168.1.115, containerHost=172.31.134.240, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=192.168.1.115, rack=null)}, capacityBytes=34359738368, usedBytes=0, lastUpdatedTimeMs=1615191064763, blocks=[], lostStorage={}}
2021-03-08 08:11:04,981 WARN  DefaultBlockMaster - Could not find worker id: 8145672099464782622 for heartbeat.
2021-03-08 08:11:04,998 INFO  DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=192.168.1.118, containerHost=172.31.229.245, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=192.168.1.118, rack=null)} id: 1427187059885272881
2021-03-08 08:11:05,035 INFO  DefaultBlockMaster - registerWorker(): MasterWorkerInfo{id=1427187059885272881, workerAddress=WorkerNetAddress{host=192.168.1.118, containerHost=172.31.229.245, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=192.168.1.118, rack=null)}, capacityBytes=34359738368, usedBytes=0, lastUpdatedTimeMs=1615191065035, blocks=[], lostStorage={}}
2021-03-08 08:11:09,220 WARN  RestUtils - Unexpected error invoking rest endpoint: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs-k8s
2021-03-08 08:11:24,145 WARN  RestUtils - Unexpected error invoking rest endpoint: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs-k8s
2021-03-08 08:11:39,115 WARN  RestUtils - Unexpected error invoking rest endpoint: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs-k8s
2021-03-08 08:11:54,117 WARN  RestUtils - Unexpected error invoking rest endpoint: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs-k8s
2021-03-08 08:12:09,156 WARN  RestUtils - Unexpected error invoking rest endpoint: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs-k8s
2021-03-08 08:12:24,115 WARN  RestUtils - Unexpected error invoking rest endpoint: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs-k8s
2021-03-08 08:12:39,115 WARN  RestUtils - Unexpected error invoking rest endpoint: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs-k8s
2021-03-08 08:12:54,149 WARN  RestUtils - Unexpected error invoking rest endpoint: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs-k8s

To Reproduce Steps to reproduce the behavior (as minimally and precisely as possible)

Expected behavior A clear and concise description of what you expected to happen. In alluxio, I want to access HA HDFS through nameservices

Urgency Describe the impact and urgency of the bug.

Additional context Add any other context about the problem here.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:16 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
ZhuTophercommented, Mar 26, 2021

@gaozhenhai I just finished testing this out in my own environment, I’ll share the details of it with you at the end.

Secret Permissions

Regarding the permissions of your core-site.xml and hdfs-site.xml, it turns out Kubernetes does not support changing ownership of secret-mounted volumes. As a workaround, you can use an initContainer to copy the secrets into a separate volume which you can change the permissions for:

    spec:
      securityContext:
        fsGroup: 1000
      initContainers:
        - name: fix-permissions
          image: debian:buster-slim
          command: ["/bin/bash", "-c"]
          args: 
          - cp -RL /mnt/secrets/hdfsconfig/* /secrets/hdfsconfig;
            chown -R 1000:1000 /secrets/hdfsconfig/;
            chmod -R 755 /secrets/hdfsconfig;
            ls -l /secrets/hdfsconfig/;
          volumeMounts:
          - name: hdfs-secret
            mountPath: /secrets/hdfsconfig
          - name: hdfs-secret-mount
            mountPath: /mnt/secrets/hdfsconfig
          securityContext:
            runAsUser: 0
 ...
      volumes:
      - name: hdfs-secret
        emptyDir: {}
      - name: hdfs-secret-mount
        secret:
          secretName: alluxio-hdfs-config

Doing this you should see the following permissions in your pods’ containers:

$ kubectl exec -it alluxio-master-0 -c alluxio-master /bin/bash
bash-4.4$ ls -l /secrets
total 0
drwxr-sr-x    2 alluxio  alluxio         48 Mar 26 00:40 hdfsconfig
bash-4.4$ ls -l /secrets/hdfsconfig/
total 8
-rwxr-xr-x    1 alluxio  alluxio        493 Mar 26 00:40 core-site.xml
-rwxr-xr-x    1 alluxio  alluxio       2302 Mar 26 00:40 hdfs-site.xml

You should add this initContainer and volumes to both alluxio-master-statefulset.yaml and alluxio-worker-daemonset.yaml. You’ll also need to add the following volumeMount to both the main container and the ‘job’ container:

            volumeMounts:
            - name: hdfs-secret
              mountPath: /secrets/hdfsconfig

Alluxio Configuration Properties

Regarding alluxio-configmap.yaml you’ll need to change -Dalluxio.underfs.hdfs.configuration=/secrets/hdfsconfig/core-site.xml:/secrets/hdfsConfig/hdfs-site.xml to -Dalluxio.master.mount.table.root.option.alluxio.underfs.hdfs.configuration=/secrets/hdfsconfig/core-site.xml:/secrets/hdfsconfig/hdfs-site.xml

  • We need to add the prefix alluxio.master.mount.table.root.option. to alluxio.underfs.hdfs.configuration
  • Also notice the typo on /secrets/hdfsConfig to /secrets/hdfsconfig

Testing Environment

alluxio-hdfs-k8s.tar.gz

  • hdfs-k8s.yaml is the HA Hadoop YAML files generated by Helm from this HDFS Helm chart (with some manual tweaks to fix typos)
    • config.yaml is the values used to generate that Helm template via helm template r1 charts/hdfs-k8s -f config.yaml > hdfs-k8s.yaml
  • pvs/ contains some scratch PersistentVolume definitions for the HA Hadoop pods
  • alluxio/ contains the Alluxio YAML files derived from our Helm chart:
    • alluxio-configmap.yaml
    • alluxio-master-statefulset.yaml
    • alluxio-master-service.yaml
    • alluxio-worker-daemonset.yaml
    • secret.yaml is a Secret containing the base64-encoded HDFS XMLs
      • hdfs-site.xml
      • core-site.xml
  1. kubectl apply -f pvs/
  2. kubectl apply -f hdfs-k8s.yaml and wait for the Zookeeper -> Namenodes -> Datanodes to all be Running
  3. kubectl apply -f alluxio/ and wait for the Master and Worker Pods to be started

This set-up for me allowed Alluxio to connect to the HA HDFS nameservice for its UFS. Unfortunately I wasn’t able to configure HDFS permissions properly to get Alluxio to persist files into HDFS but it was successfully able to connect to the nameservice endpoint.

Conclusion

Let me know if this resolves your issue, thanks!

1reaction
gaozhenhaicommented, Mar 20, 2021

@gaozhenhai Something that was brought to my attention, can you kubectl exec into your Alluxio master Pod(s) and show the permissions of the mounted HDFS configs? eg:

$ kubectl -n gaozh exec -it alluxio-master-0 -c alluxio-master /bin/bash
# ls -l /secrets/

I suspect those will be owned by root:root and aren’t readable by the Alluxio process (which runs as 1000:1000). This is an issue about our Helm templates which we are fixing in #13061. In the meantime you can adjust your alluxio-master-statefulset.yaml to contain the following:

spec:
  template:
    spec:
      securityContext:
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000

Adding this change should allow the Secret passed as a Volume to be owned by 1000:1000. Let me know if this is the case or not and whether that solves your issue. Thanks!

@ZhuTopher The permissions for the mounted HDFS configuration are as follows image

I update the spec.template.spec.securityContext fields, and restart the pod, but still an error: java.net.UnknownHostException: hdfs-k8s

...
spec:
  selector:
    matchLabels:
      app: alluxio
      role: alluxio-master
      name: alluxio-master
  serviceName: alluxio-master
  replicas: 3
  template:
    metadata:
      labels:
        name: alluxio-master
        app: alluxio
        chart: alluxio-0.6.11
        release: alluxio
        heritage: Helm
        role: alluxio-master
    spec:
      hostNetwork: false
      dnsPolicy: ClusterFirst
      nodeSelector:
      securityContext:
        fsGroup: 1000
        runAsUser: 1000
        runAsGroup: 1000
...

I’m not sure if it’s a permission issue, because Aluxio doesn’t print any permissions errors You can use my YAML files and images to install an HA HDFS environment on kubernetes to find the root cause of the problem

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Set Up Hadoop Cluster with HDFS High Availability
This blog provides an overview of the HDFS High Availability architecture and how to set up and configure a HDFS High Availability cluster ......
Read more >
Apache Hadoop 3.3.4 – HDFS High Availability
This guide provides an overview of the HDFS High Availability (HA) ... the NameNode was a single point of failure (SPOF) in an...
Read more >
HBase fails to start when HDFS HA is deployed and HBase is ...
Follow the steps to resolve this issue: Go to HBase configuration in Ambari; Search for "hbase.rootdir" property; Fix the value to include the ......
Read more >
Apache ozone adapter for hadoop client not resolving HA ...
I'm running Apache Ozone with high availability manager nodes on kubernetes. I was able to get the leader auto-election working, ...
Read more >
HOW TO: Configure HA in Hadoop HDFS connection after ...
When HA is enabled in the HDFS service of a cluster, connections configuration must use the NameServices parameter instead of the node name....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found