Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Separate bindings from Pravega

See original GitHub issue

Problem description This issue tracks the initiative to separate bindings from Pravega source

Problem location The source will be moved from project(‘bindings’) to its own repository

Suggestions for an improvement A draft for the corresponding PDP is included below.

Associated code changes: https://github.com/pravega/pravega-bindings/pull/1 https://github.com/pravega/pravega/pull/4908

Status: Draft

Summary

This PDP discusses a change to not require S3, hdfs and file-system bindings to be build with the Pravega because they incur a lot of technical debt. Once implemented, this proposal will allow the bindings to be discovered at runtime and used independently. The project(bindings) will be removed from build.gradle and will appear in its own repository.

Shortcomings of the current implementation

External dependency. The current implementation of Bindings is at the mercy of all defects from external service providers such as HDFS, extended S3 and file-system implementors. Frequently, they change their methods and properties in their apis and cause break-in changes in Pravega while the bindings are not always required for the execution of the stream store.
Incorrect usages. Usually the bindings are loaded to serve the standalone mode of deployment. However, they are subject to misuse. For example, hdfs binding is used to launch an distributed file system cluster over ephemeral storage all within a single node. Such deployments should not be in production.

Key Design Changes

Proposal

Below is the summary of this proposal.

Step 1: Implement the bindings in a separate repository.

The bindings project will be completely removed from the Pravega source. It serves to provide implementations for SyncStorage which will be resolved dynamically.
It will be hosted in a separate repository where it can be serviced more frequently as and when the external dependencies provided updates.

Step 2: Re-Implement Bindings to load the class libraries.

NFS
HDFS
Extended S3

Step 3: Re-implement segment store to load the class libraries:

ChunkManager
Metadata table store and
Chunk Storage provider Earlier these used to correspond to
No-op storage
Asynchronous storage
Rolling storage Use ChunkStorageProvider not just in a simulated environment but also in the actual deployment corresponding to No-op, asynchronous and Rolling storage.

API Changes

There are no user level API changes to ChunkStorageProvider.
There are no changes to how tier-2 is configured.

API Testing

All usages of ChunkStorageProvider from segmentstore corresponding to No-op, async and rolling storages will be exercised.
Both standalone mode and distributed mode of deployment will be used. InMemory Cluster testing is not part of the feature parity.
Testing will include byte level parity between segments in a stream on one storage adapter versus another.

Internal Changes

Architecture Change

Please refer PDP-34 for the changes to the internals of the binding. Instead, this will focus on the changes in the usages of the binding. This PDP could be implemented before or after PDP-34. Either way, it is not impacted. Earlier, there was no changes to the host in the segment store which used to persist the stream to file, blobs or hdfs. In this case, the segmentstore will maintain the same providers but without requiring them at compile time. The interface used by the segmentstore server host is SyncStorage and the bindings is responsible to implement that interface. There is a clear separation between bindings users and SyncStorage

Internal API changes

Tier-2 storage providers
Operate only at chunk level.
This works for file, blob and hdfs in no-op, asynchronous and rolling mode.

Internal Functionality changes

Storage related metadata
All storage related metadata is stored in table segments.
Ability to import and export this data in backward compatible
Consolidation of functionality
SegmentStore will not be aware of the internals of bindings
Bindings will be loaded at runtime and operated via a unified interface.

Key Concepts

SyncStorage

This is the interface used to refer to nfs, extendeds3 or hdfs storage.
RollingStorage also implements this interface but it is a pseudo-storage because it is a layer above one of three mentioned
The choice of SyncStorage is resolved by StorageLoader that resolves the StorageFactory by name which is specified in the ServiceConfig properties with key storageImplementation. Each StorageFactoryImplementation is dependent on the StorageAdapter configured. The call sequence looks like this:

SegmentStore-         StorageLoader    ServiceLoader       StorageFactoryCreator        StorageFactory
-HostServiceStarter
 | -> configSetup
                     | ---------->   load()
                       <----------   loader
                     | -------------------------------> createFactory()
                       <----------   factory()
                     | ---------------------------------------------------------------->attachStorage()
                       <---------------------------------------------------------------

This implementation remains the same. Only the NoopStorageProvider is retained

Test refactoring

Bulk of the changes in this code is the refactoring required in test
The adapters used by integration tests include SegmentStoreAdapter, AppendProcessorAdapter, InProcessListenerWithRealStoreAdapter. These will move to the bindings repository
Although adapters are independently written by test, they will need to be included differently in the test project which does not make the integration test pruning as clean as the main source
The standalone case also needs rework. Even though the local filesystem is used in the standalone case and the stream store is hosted in memory, the file current code relies on refactoring and this will have to change.

Key Concerns

Bootstrap

Startup involves configuration and this has to be passed to the StorageLoader. With the separation of the bindings, there is no specific way to validate these. A storageFactoryCreator is found only if the name matches. Currently, there is no validation possible other than the load failure.
The bindings continue to be a service. Their load is independent of the main source and may result in failures. The logs will indicate if Pravega is not able to startup. It can be immediately rectified with including the correct jars with Pravega.
Compatibility

Byte level parity

List all segments in a stream. Ensure that the byte level parity exists before and after separation.
Change adapters. Repeat above

Backup/Restore

Import segment metadata previously imported to a file.

Key Questions Answered

Will the bindings work if tier-1 is changed to something other than Bookkeeper? Yes, that is not a problem, this impacts only tier 2
The Pravega source and the bindings rely on each other No. the bindings will compile and build independently Callouts

API contracts

The following classes and contracts will move to the bindings repository.

io.pravega.segmentstore.storage.Storage;
io.pravega.segmentstore.storage.rolling.RollingSegmentHandle;
io.pravega.segmentstore.contracts.StorageNotPrimaryException;

Assumptions

Tier 2 is a layer below the stream store and does not have to be included other than for persistence. Pravega can be entirely in memory.

Priorities and Tradeoffs

Prefer repository refactoring rather than code refactor
Load bindings at runtime
The tradeoff of this separation is maintainability of code, servicing and building the other automations. Pros and Cons The upside:

Support for wider variety of Tier 2 implementations
Less metadata in Tier 2 (no more Header files)
Bindings separated for test versus production deployments
Configurable as usual
Standalone usage of binding continues The downside:
Resolving storage provider at run time loses integration value
Need to add different integration tests.

Issue Analytics

State:
Created 3 years ago
Comments:10 (10 by maintainers)

Top GitHub Comments

2reactions

eolivellicommented, Jun 29, 2020

As NFS tier2 is basically about using the FILESYSTEM, would it be an option to at least keep it into main repository? It uses only JRE libraries, it does not import any additional library. This way building Pravega results it a fully working package, with persistent tier2 that is very useful for testing integrated products and for demos.

1reaction

medvedevigorekcommented, Jun 29, 2020

There is another aspect that needs to be taken into account: currently the docker image has the logic that configures one of the supported storage bindings and with the bindings being a runtime dependency now there should be some other way to process them at startup.

Top Results From Across the Web

Python Bindings - Pravega Native Client

If multiple readers are created under the same reader group then the readers co-ordinate amongst themselves and divide the responsibility of reading from ......

Pravega Roadmap

Spillover from 0.9; Separate LTS configurations per scope ... Spillover from 0.10; Refactor Tier-1; Native client bindings (phase 2: Golang) ...

PDP 34 (Simplified Tier 2) · pravega/pravega Wiki - GitHub

None of the current Tier 2 bindings in Pravega support all of these, ... Step 3 (to be implemented in future as a...

Pravega: Storage for data streams - YouTube

Pravega : Storage for data streamsFlavio JunqueiraA presentation from ApacheCon @Home 2020https://apachecon.com/acah2020/There is no shortage ...

Cloud-Native Event Streaming With Pravega | by Alex

Something with storage, an easy interface, and hopefully some python bindings that I could hook into relatively easily. I found two extremely ...