Support download model through general WebHDFS REST API
See original GitHub issue/kind feature
Describe the solution you’d like [A clear and concise description of what you want to happen.]
We have a requirement of downloading file from our Hadoop based storage using WebHDFS REST API with TLS (has to provide TLS cert and a header with id). To make it general, TLS and headers can be optional.
User can use the following command to generate the secret:
kubectl create secret generic hdfscreds \
--from-file=TLS_CERT=./client.crt \
--from-file=TLS_KEY=./client.key \
--from-literal=HDFS_NAMENODE="https://host:port" \
--from-literal=HDFS_ROOTPATH="/" \
--from-literal=HEADERS="{'x-my-container': 'my-container'}"
format of the secret:
apiVersion: v1
kind: Secret
metadata:
name: hdfscreds
type: Opaque
data:
HDFS_NAMENODE: 'http(s)://host:port'
HDFS_ROOTPATH: xxxx # string, default: "/"
TLS_CERT: xxxx # string, default: ""
TLS_KEY: xxxx # string, default: ""
TLS_CA: xxxx # string, default: ""
TLS_INSECURE_SKIP_VERIFY: # string (true|false), default: "false"
HEADERS: xxxx # string, default: ""
To implement it in storage initializer, hdfscli is used.
simple code sample:
from requests import Session
from hdfs.client import Client
HDFS_NAMENODE="https://host:port"
HDFS_ROOTPATH="/"
HEADERS={'x-my-container': 'my-container'} # parse from string
TLS_CERT="path to TLS_CERT"
TLS_KEY="path to TLS_KEY"
TLS_CA="path to TLS_CA"
TLS_INSECURE_SKIP_VERIFY = True|False
FILE_PATH = "path to file or directory"
OUT_PATH = "path to output directory"
s = Session()
s.cert = (TLS_CERT, TLS_KEY)
# s.verify = , True, False, or CA PATH
if TLS_CA:
s.verify = TLS_CA
if TLS_INSECURE_SKIP_VERIFY:
s.verify = False
s.headers.update(HEADERS)
# Has to set root to none empty value otherwise it will try to GETHOMEDIRECTORY, but user don't have the permission
client = Client(url=HDFS_NAMENODE, root=HDFS_ROOTPATH, session=s)
status = client.status(FILE_PATH, strict=False)
print(status) # None of not exist
# Downlaod file or directory to output directory
client.download(FILE_PATH, OUT_PATH, n_threads=2)
The TLS_CERT and TLS_KEY as well as TLS_CA can be mounted as volume just like GCS or as a temp file if it is supported.
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/WebHDFS.html https://hdfscli.readthedocs.io/en/latest/quickstart.html#python-bindings
Issue Analytics
- State:
- Created a year ago
- Comments:6 (6 by maintainers)
Top GitHub Comments
@lizzzcai Thanks. I noticed the same issue before as well. Below is how I currently have it implemented.
If the user gives a hdfs file, it downloads just the file into
/mnt/models
. If the user gives a hdfs directory, then it lists that directory and downloads everything inside to/mnt/models
I think this will do the same as the code you gave.
Thanks @markwinter , that is very good news. Adding to the existing PR should be fine. Please let me know if you need any help.