Unable to resolve nameservices for HA HDFS when HA HDFS in kubernetes
See original GitHub issueAlluxio Version: What version of Alluxio are you using? 2.4.1-1
Describe the bug A clear and concise description of what the bug is. 1、High availability HDFS is deployed in Kubernetes with the following configuration file core-site.txt hdfs-site.txt
2、Allxuxio starts the master using configMap as shown below
alluxio-configmap.txt
-Dalluxio.master.mount.table.root.ufs=hdfs://hdfs-k8s/alluxio
3、But the Alluxio master has an error,java.net.UnknownHostException: hdfs-k8s
2021-03-08 08:10:20,883 WARN MetricRegistriesImpl - First MetricRegistry has been created without registering reporters. You may need to call MetricRegistries.global().addReportRegistration(...) before.
2021-03-08 08:10:20,885 INFO RaftJournalSystem - Performing catchup. Last applied SN: 4. Catchup ID: -3575425480893704492
2021-03-08 08:10:20,886 INFO RaftServerConfigKeys - raft.server.write.element-limit = 4096 (default)
2021-03-08 08:10:20,887 INFO RaftServerConfigKeys - raft.server.write.byte-limit = 167772160 (custom)
2021-03-08 08:10:20,890 INFO RaftJournalSystem - Exception submitting term start entry: java.util.concurrent.ExecutionException: org.apache.ratis.protocol.LeaderNotReadyException: alluxio-master-1_19200@group-ABB3109A44C1 is in LEADER state but not ready yet.
2021-03-08 08:10:20,894 INFO RaftServerConfigKeys - raft.server.watch.timeout = 10s (default)
2021-03-08 08:10:20,895 INFO RaftServerConfigKeys - raft.server.watch.timeout.denomination = 1s (default)
2021-03-08 08:10:20,896 INFO RaftServerConfigKeys - raft.server.watch.element-limit = 65536 (default)
2021-03-08 08:10:20,908 INFO RaftServerConfigKeys - raft.server.log.appender.snapshot.chunk.size.max = 16MB (=16777216) (default)
2021-03-08 08:10:20,908 INFO RaftServerConfigKeys - raft.server.log.appender.buffer.byte-limit = 10485760 (custom)
2021-03-08 08:10:20,908 INFO RaftServerConfigKeys - raft.server.log.appender.buffer.element-limit = 0 (default)
2021-03-08 08:10:20,913 INFO GrpcConfigKeys - raft.grpc.server.leader.outstanding.appends.max = 128 (default)
2021-03-08 08:10:20,913 INFO RaftServerConfigKeys - raft.server.rpc.request.timeout = 5000ms (custom)
2021-03-08 08:10:20,914 INFO RaftServerConfigKeys - raft.server.log.appender.install.snapshot.enabled = false (custom)
2021-03-08 08:10:20,914 INFO RatisMetrics - Creating Metrics Registry : ratis_grpc.log_appender.alluxio-master-1_19200@group-ABB3109A44C1
2021-03-08 08:10:20,914 WARN MetricRegistriesImpl - First MetricRegistry has been created without registering reporters. You may need to call MetricRegistries.global().addReportRegistration(...) before.
2021-03-08 08:10:20,919 INFO RaftServerConfigKeys - raft.server.log.appender.snapshot.chunk.size.max = 16MB (=16777216) (default)
2021-03-08 08:10:20,919 INFO RaftServerConfigKeys - raft.server.log.appender.buffer.byte-limit = 10485760 (custom)
2021-03-08 08:10:20,919 INFO RaftServerConfigKeys - raft.server.log.appender.buffer.element-limit = 0 (default)
2021-03-08 08:10:20,919 INFO GrpcConfigKeys - raft.grpc.server.leader.outstanding.appends.max = 128 (default)
2021-03-08 08:10:20,920 INFO RaftServerConfigKeys - raft.server.rpc.request.timeout = 5000ms (custom)
2021-03-08 08:10:20,920 INFO RaftServerConfigKeys - raft.server.log.appender.install.snapshot.enabled = false (custom)
2021-03-08 08:10:20,922 INFO RoleInfo - alluxio-master-1_19200: start LeaderState
2021-03-08 08:10:20,936 INFO SegmentedRaftLogWorker - alluxio-master-1_19200@group-ABB3109A44C1-SegmentedRaftLogWorker: Rolling segment log-47_50 to index:50
2021-03-08 08:10:20,946 INFO SegmentedRaftLogWorker - alluxio-master-1_19200@group-ABB3109A44C1-SegmentedRaftLogWorker: Rolled log segment from /journal/raft/02511d47-d67c-49a3-9011-abb3109a44c1/current/log_inprogress_47 to /journal/raft/02511d47-d67c-49a3-9011-abb3109a44c1/current/log_47-50
2021-03-08 08:10:21,090 INFO SegmentedRaftLogWorker - alluxio-master-1_19200@group-ABB3109A44C1-SegmentedRaftLogWorker: created new log segment /journal/raft/02511d47-d67c-49a3-9011-abb3109a44c1/current/log_inprogress_51
2021-03-08 08:10:21,890 INFO RaftJournalSystem - Performing catchup. Last applied SN: 4. Catchup ID: -7761596533413317960
2021-03-08 08:10:41,917 INFO RaftJournalSystem - Caught up in 21032ms. Last sequence number from previous term: 4.
2021-03-08 08:10:41,923 INFO AbstractMaster - MetricsMaster: Starting primary master.
2021-03-08 08:10:41,925 INFO MetricsSystem - Reset all metrics in the metrics system in 1ms
2021-03-08 08:10:41,925 INFO MetricsStore - Cleared the metrics store and metrics system in 1 ms
2021-03-08 08:10:41,926 INFO AbstractMaster - BlockMaster: Starting primary master.
2021-03-08 08:10:41,927 INFO AbstractMaster - FileSystemMaster: Starting primary master.
2021-03-08 08:10:41,928 INFO DefaultFileSystemMaster - Starting fs master as primary
2021-03-08 08:10:41,948 INFO AbstractMaster - MetaMaster: Starting primary master.
2021-03-08 08:10:41,971 INFO DefaultMetaMaster - Detected existing cluster ID 0efda228-6f86-4bb4-b467-3dc68899d970
2021-03-08 08:10:41,998 ERROR HeartbeatThread - Uncaught exception in heartbeat executor, Heartbeat Thread shutting down
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs-k8s
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2051)
at com.google.common.cache.LocalCache.get(LocalCache.java:3951)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3974)
at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4958)
at alluxio.underfs.hdfs.HdfsUnderFileSystem.getFs(HdfsUnderFileSystem.java:811)
at alluxio.underfs.hdfs.HdfsUnderFileSystem.getSpace(HdfsUnderFileSystem.java:388)
at alluxio.underfs.UnderFileSystemWithLogging$26.call(UnderFileSystemWithLogging.java:595)
at alluxio.underfs.UnderFileSystemWithLogging$26.call(UnderFileSystemWithLogging.java:592)
at alluxio.underfs.UnderFileSystemWithLogging.call(UnderFileSystemWithLogging.java:1208)
at alluxio.underfs.UnderFileSystemWithLogging.getSpace(UnderFileSystemWithLogging.java:592)
at alluxio.master.file.DefaultFileSystemMaster$Metrics.lambda$registerGauges$3(DefaultFileSystemMaster.java:4368)
at alluxio.master.file.DefaultFileSystemMaster$TimeSeriesRecorder.heartbeat(DefaultFileSystemMaster.java:4137)
at alluxio.heartbeat.HeartbeatThread.run(HeartbeatThread.java:119)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs-k8s
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378)
at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:678)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:619)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at alluxio.underfs.hdfs.HdfsUnderFileSystem$1.load(HdfsUnderFileSystem.java:169)
at alluxio.underfs.hdfs.HdfsUnderFileSystem$1.load(HdfsUnderFileSystem.java:155)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3529)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2278)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2155)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2045)
... 17 more
Caused by: java.net.UnknownHostException: hdfs-k8s
... 32 more
2021-03-08 08:10:42,013 INFO BackupTracker - Resetting backup tracker.
2021-03-08 08:10:42,015 INFO BackupLeaderRole - Creating backup-leader role.
2021-03-08 08:10:42,015 INFO AbstractMaster - TableMaster: Starting primary master.
2021-03-08 08:10:42,017 INFO AlluxioMasterProcess - All masters started
2021-03-08 08:10:42,022 INFO MetricsSystem - Starting sinks with config: {}.
2021-03-08 08:10:42,022 INFO AlluxioMasterProcess - Alluxio master web server version 2.4.1-1 starting (gained leadership). webAddress=/0.0.0.0:19999
2021-03-08 08:10:42,049 INFO log - Logging initialized @68234ms to org.eclipse.jetty.util.log.Slf4jLog
2021-03-08 08:10:42,357 INFO WebServer - Alluxio Master Web service starting @ /0.0.0.0:19999
2021-03-08 08:10:42,360 INFO Server - jetty-9.4.31.v20200723; built: 2020-07-23T17:57:36.812Z; git: 450ba27947e13e66baa8cd1ce7e85a4461cacc1d; jvm 1.8.0_212-b04
2021-03-08 08:10:42,413 INFO ContextHandler - Started o.e.j.s.ServletContextHandler@b4836c5{/metrics/prometheus,null,AVAILABLE}
2021-03-08 08:10:42,414 INFO ContextHandler - Started o.e.j.s.ServletContextHandler@7048da95{/metrics/json,null,AVAILABLE}
2021-03-08 08:10:42,416 WARN SecurityHandler - ServletContext@o.e.j.s.ServletContextHandler@5b1b1956{/,file:///opt/alluxio-2.4.1-1/webui/master/build/,STARTING} has uncovered http methods for path: /
2021-03-08 08:11:00,293 INFO ContextHandler - Started o.e.j.s.ServletContextHandler@5b1b1956{/,file:///opt/alluxio-2.4.1-1/webui/master/build/,AVAILABLE}
2021-03-08 08:11:00,311 INFO AbstractConnector - Started ServerConnector@4081b016{HTTP/1.1, (http/1.1)}{0.0.0.0:19999}
2021-03-08 08:11:00,311 INFO Server - Started @86496ms
2021-03-08 08:11:00,311 INFO WebServer - Alluxio Master Web service started @ /0.0.0.0:19999
2021-03-08 08:11:00,333 INFO AlluxioMasterProcess - Alluxio master version 2.4.1-1 started (gained leadership). bindAddress=/0.0.0.0:19998, connectAddress=alluxio-master-1:19998, webAddress=/0.0.0.0:19999
2021-03-08 08:11:00,335 INFO AlluxioMasterProcess - Starting Alluxio master gRPC server on address /0.0.0.0:19998
2021-03-08 08:11:00,502 INFO MasterProcess - registered service METRICS_MASTER_CLIENT_SERVICE
2021-03-08 08:11:00,700 INFO MasterProcess - registered service BLOCK_MASTER_CLIENT_SERVICE
2021-03-08 08:11:00,700 INFO MasterProcess - registered service BLOCK_MASTER_WORKER_SERVICE
2021-03-08 08:11:01,696 INFO MasterProcess - registered service FILE_SYSTEM_MASTER_JOB_SERVICE
2021-03-08 08:11:01,697 INFO MasterProcess - registered service FILE_SYSTEM_MASTER_WORKER_SERVICE
2021-03-08 08:11:01,698 INFO MasterProcess - registered service FILE_SYSTEM_MASTER_CLIENT_SERVICE
2021-03-08 08:11:01,842 INFO MasterProcess - registered service META_MASTER_CONFIG_SERVICE
2021-03-08 08:11:01,842 INFO MasterProcess - registered service META_MASTER_BACKUP_MESSAGING_SERVICE
2021-03-08 08:11:01,842 INFO MasterProcess - registered service RAFT_JOURNAL_SERVICE
2021-03-08 08:11:01,842 INFO MasterProcess - registered service META_MASTER_CLIENT_SERVICE
2021-03-08 08:11:01,842 INFO MasterProcess - registered service META_MASTER_MASTER_SERVICE
2021-03-08 08:11:01,891 INFO MasterProcess - registered service TABLE_MASTER_CLIENT_SERVICE
2021-03-08 08:11:01,963 INFO DefaultSafeModeManager - Rpc server started, waiting 5000ms for workers to register
2021-03-08 08:11:01,964 INFO AlluxioMasterProcess - Started Alluxio master gRPC server on address alluxio-master-1:19998
2021-03-08 08:11:01,972 INFO FaultTolerantAlluxioMasterProcess - Primary started
2021-03-08 08:11:02,563 WARN DefaultBlockMaster - Could not find worker id: 4512984809611543378 for heartbeat.
2021-03-08 08:11:02,628 INFO DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=192.168.1.116, containerHost=172.31.141.184, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=192.168.1.116, rack=null)} id: 7934004638968114946
2021-03-08 08:11:02,678 INFO DefaultBlockMaster - registerWorker(): MasterWorkerInfo{id=7934004638968114946, workerAddress=WorkerNetAddress{host=192.168.1.116, containerHost=172.31.141.184, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=192.168.1.116, rack=null)}, capacityBytes=34359738368, usedBytes=0, lastUpdatedTimeMs=1615191062678, blocks=[], lostStorage={}}
2021-03-08 08:11:02,807 INFO DefaultMetaMaster - getMasterId(): MasterAddress: alluxio-master-0:19998 id: 2798209153424597338
2021-03-08 08:11:02,864 INFO DefaultMetaMaster - registerMaster(): master: MasterInfo{id=2798209153424597338, address=alluxio-master-0:19998, lastUpdatedTimeMs=1615191062862}
2021-03-08 08:11:03,782 WARN DefaultBlockMaster - Could not find worker id: 5061862824333568095 for heartbeat.
2021-03-08 08:11:03,801 INFO DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=192.168.1.117, containerHost=172.31.228.236, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=192.168.1.117, rack=null)} id: 951288178340284032
2021-03-08 08:11:03,816 INFO DefaultBlockMaster - registerWorker(): MasterWorkerInfo{id=951288178340284032, workerAddress=WorkerNetAddress{host=192.168.1.117, containerHost=172.31.228.236, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=192.168.1.117, rack=null)}, capacityBytes=34359738368, usedBytes=0, lastUpdatedTimeMs=1615191063815, blocks=[], lostStorage={}}
2021-03-08 08:11:04,544 INFO DefaultMetaMaster - getMasterId(): MasterAddress: alluxio-master-2:19998 id: 6474051121844857814
2021-03-08 08:11:04,584 INFO DefaultMetaMaster - registerMaster(): master: MasterInfo{id=6474051121844857814, address=alluxio-master-2:19998, lastUpdatedTimeMs=1615191064583}
2021-03-08 08:11:04,726 WARN DefaultBlockMaster - Could not find worker id: 7799114670528034177 for heartbeat.
2021-03-08 08:11:04,747 INFO DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=192.168.1.115, containerHost=172.31.134.240, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=192.168.1.115, rack=null)} id: 243034216886158838
2021-03-08 08:11:04,764 INFO DefaultBlockMaster - registerWorker(): MasterWorkerInfo{id=243034216886158838, workerAddress=WorkerNetAddress{host=192.168.1.115, containerHost=172.31.134.240, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=192.168.1.115, rack=null)}, capacityBytes=34359738368, usedBytes=0, lastUpdatedTimeMs=1615191064763, blocks=[], lostStorage={}}
2021-03-08 08:11:04,981 WARN DefaultBlockMaster - Could not find worker id: 8145672099464782622 for heartbeat.
2021-03-08 08:11:04,998 INFO DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=192.168.1.118, containerHost=172.31.229.245, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=192.168.1.118, rack=null)} id: 1427187059885272881
2021-03-08 08:11:05,035 INFO DefaultBlockMaster - registerWorker(): MasterWorkerInfo{id=1427187059885272881, workerAddress=WorkerNetAddress{host=192.168.1.118, containerHost=172.31.229.245, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=192.168.1.118, rack=null)}, capacityBytes=34359738368, usedBytes=0, lastUpdatedTimeMs=1615191065035, blocks=[], lostStorage={}}
2021-03-08 08:11:09,220 WARN RestUtils - Unexpected error invoking rest endpoint: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs-k8s
2021-03-08 08:11:24,145 WARN RestUtils - Unexpected error invoking rest endpoint: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs-k8s
2021-03-08 08:11:39,115 WARN RestUtils - Unexpected error invoking rest endpoint: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs-k8s
2021-03-08 08:11:54,117 WARN RestUtils - Unexpected error invoking rest endpoint: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs-k8s
2021-03-08 08:12:09,156 WARN RestUtils - Unexpected error invoking rest endpoint: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs-k8s
2021-03-08 08:12:24,115 WARN RestUtils - Unexpected error invoking rest endpoint: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs-k8s
2021-03-08 08:12:39,115 WARN RestUtils - Unexpected error invoking rest endpoint: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs-k8s
2021-03-08 08:12:54,149 WARN RestUtils - Unexpected error invoking rest endpoint: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs-k8s
To Reproduce Steps to reproduce the behavior (as minimally and precisely as possible)
Expected behavior A clear and concise description of what you expected to happen. In alluxio, I want to access HA HDFS through nameservices
Urgency Describe the impact and urgency of the bug.
Additional context Add any other context about the problem here.
Issue Analytics
- State:
- Created 3 years ago
- Comments:16 (10 by maintainers)
Top GitHub Comments
@gaozhenhai I just finished testing this out in my own environment, I’ll share the details of it with you at the end.
Secret Permissions
Regarding the permissions of your
core-site.xml
andhdfs-site.xml
, it turns out Kubernetes does not support changing ownership of secret-mounted volumes. As a workaround, you can use aninitContainer
to copy the secrets into a separate volume which you can change the permissions for:Doing this you should see the following permissions in your pods’ containers:
You should add this
initContainer
andvolumes
to bothalluxio-master-statefulset.yaml
andalluxio-worker-daemonset.yaml
. You’ll also need to add the followingvolumeMount
to both the main container and the ‘job’ container:Alluxio Configuration Properties
Regarding
alluxio-configmap.yaml
you’ll need to change-Dalluxio.underfs.hdfs.configuration=/secrets/hdfsconfig/core-site.xml:/secrets/hdfsConfig/hdfs-site.xml
to-Dalluxio.master.mount.table.root.option.alluxio.underfs.hdfs.configuration=/secrets/hdfsconfig/core-site.xml:/secrets/hdfsconfig/hdfs-site.xml
alluxio.master.mount.table.root.option.
toalluxio.underfs.hdfs.configuration
/secrets/hdfsConfig
to/secrets/hdfsconfig
Testing Environment
alluxio-hdfs-k8s.tar.gz
hdfs-k8s.yaml
is the HA Hadoop YAML files generated by Helm from this HDFS Helm chart (with some manual tweaks to fix typos)config.yaml
is the values used to generate that Helm template viahelm template r1 charts/hdfs-k8s -f config.yaml > hdfs-k8s.yaml
pvs/
contains some scratchPersistentVolume
definitions for the HA Hadoop podsalluxio/
contains the Alluxio YAML files derived from our Helm chart:alluxio-configmap.yaml
alluxio-master-statefulset.yaml
alluxio-master-service.yaml
alluxio-worker-daemonset.yaml
secret.yaml
is aSecret
containing the base64-encoded HDFS XMLshdfs-site.xml
core-site.xml
kubectl apply -f pvs/
kubectl apply -f hdfs-k8s.yaml
and wait for the Zookeeper -> Namenodes -> Datanodes to all beRunning
kubectl apply -f alluxio/
and wait for the Master and Worker Pods to be startedThis set-up for me allowed Alluxio to connect to the HA HDFS nameservice for its UFS. Unfortunately I wasn’t able to configure HDFS permissions properly to get Alluxio to persist files into HDFS but it was successfully able to connect to the nameservice endpoint.
Conclusion
Let me know if this resolves your issue, thanks!
@ZhuTopher The permissions for the mounted HDFS configuration are as follows
I update the spec.template.spec.securityContext fields, and restart the pod, but still an error: java.net.UnknownHostException: hdfs-k8s
I’m not sure if it’s a permission issue, because Aluxio doesn’t print any permissions errors You can use my YAML files and images to install an HA HDFS environment on kubernetes to find the root cause of the problem