question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support download model through general WebHDFS REST API

See original GitHub issue

/kind feature

Describe the solution you’d like [A clear and concise description of what you want to happen.]

We have a requirement of downloading file from our Hadoop based storage using WebHDFS REST API with TLS (has to provide TLS cert and a header with id). To make it general, TLS and headers can be optional.

User can use the following command to generate the secret:

kubectl create secret generic hdfscreds \
    --from-file=TLS_CERT=./client.crt \
    --from-file=TLS_KEY=./client.key \
    --from-literal=HDFS_NAMENODE="https://host:port" \
    --from-literal=HDFS_ROOTPATH="/" \
    --from-literal=HEADERS="{'x-my-container': 'my-container'}"

format of the secret:

apiVersion: v1
kind: Secret
metadata:
  name: hdfscreds
type: Opaque
data:
  HDFS_NAMENODE: 'http(s)://host:port'
  HDFS_ROOTPATH: xxxx # string, default: "/"
  TLS_CERT: xxxx # string, default: ""
  TLS_KEY: xxxx # string, default: ""
  TLS_CA: xxxx # string, default: ""
  TLS_INSECURE_SKIP_VERIFY: # string (true|false), default: "false"
  HEADERS: xxxx # string, default: ""

To implement it in storage initializer, hdfscli is used.

simple code sample:

from requests import Session
from hdfs.client import Client

HDFS_NAMENODE="https://host:port"
HDFS_ROOTPATH="/"
HEADERS={'x-my-container': 'my-container'} # parse from string
TLS_CERT="path to TLS_CERT"
TLS_KEY="path to TLS_KEY"
TLS_CA="path to TLS_CA"
TLS_INSECURE_SKIP_VERIFY = True|False
FILE_PATH = "path to file or directory"
OUT_PATH = "path to output directory"


s = Session()
s.cert = (TLS_CERT, TLS_KEY)
# s.verify = , True, False, or CA PATH
if TLS_CA:
  s.verify = TLS_CA
if TLS_INSECURE_SKIP_VERIFY:
  s.verify = False

s.headers.update(HEADERS)

# Has to set root to none empty value otherwise it will try to GETHOMEDIRECTORY, but user don't have the permission
client = Client(url=HDFS_NAMENODE, root=HDFS_ROOTPATH, session=s)
status = client.status(FILE_PATH, strict=False)
print(status) # None of not exist

# Downlaod file or directory to output directory
client.download(FILE_PATH, OUT_PATH, n_threads=2)

The TLS_CERT and TLS_KEY as well as TLS_CA can be mounted as volume just like GCS or as a temp file if it is supported.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/WebHDFS.html https://hdfscli.readthedocs.io/en/latest/quickstart.html#python-bindings

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
markwintercommented, Apr 19, 2022

@lizzzcai Thanks. I noticed the same issue before as well. Below is how I currently have it implemented.

If the user gives a hdfs file, it downloads just the file into /mnt/models. If the user gives a hdfs directory, then it lists that directory and downloads everything inside to /mnt/models

I think this will do the same as the code you gave.

client = KerberosClient(namenode)

# Check path exists
# Raises HdfsError when path does not exist
client.status(path)

try:
    files = client.list(path)
    for f in files:
        client.download(f"{path}/{f}", out_dir, n_threads=2)
except HdfsError:
    # client.list raises exception when path is a file
    client.download(path, out_dir, n_threads=2)
1reaction
lizzzcaicommented, Apr 13, 2022

@lizzzcai Thanks for clarifying. I’ll start working on this this week. I will probably add it to my existing PR

Thanks @markwinter , that is very good news. Adding to the existing PR should be fine. Please let me know if you need any help.

Read more comments on GitHub >

github_iconTop Results From Across the Web

WebHDFS REST API - Apache Hadoop
Introduction. The HTTP REST API supports the complete FileSystem interface for HDFS. The operations and the corresponding FileSystem methods are shown in the ......
Read more >
WebHDFS FileSystem APIs - Microsoft Learn
Learn how Azure Data Lake Store is a cloud-scale file system that is compatible with Hadoop Distributed File System (HDFS) and works with ......
Read more >
Working with WEBHDFS Rest API Simplified 101 - Hevo Data
WEBHDFS is a REST API that supports HTTP operations like GET POST, PUT, and DELETE. It allows client applications to access HDFS data...
Read more >
Download a file from HDFS cluster - hadoop - Stack Overflow
I have made a REST api for allowing a server to mkdir, ls, create and delete a file in the HDFS cluster using...
Read more >
How would you download (copy) a directory with WebHDFS ...
The WebHDFS REST API alone doesn't implement any such recursive operations. ... to be implemented on the client side by listing the contents...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found