Separate bindings from Pravega
See original GitHub issueProblem description This issue tracks the initiative to separate bindings from Pravega source
Problem location The source will be moved from project(‘bindings’) to its own repository
Suggestions for an improvement A draft for the corresponding PDP is included below.
Associated code changes: https://github.com/pravega/pravega-bindings/pull/1 https://github.com/pravega/pravega/pull/4908
Status: Draft
Summary
This PDP discusses a change to not require S3, hdfs and file-system bindings to be build with the Pravega because they incur a lot of technical debt. Once implemented, this proposal will allow the bindings to be discovered at runtime and used independently. The project(bindings) will be removed from build.gradle and will appear in its own repository.
Shortcomings of the current implementation
- External dependency. The current implementation of Bindings is at the mercy of all defects from external service providers such as HDFS, extended S3 and file-system implementors. Frequently, they change their methods and properties in their apis and cause break-in changes in Pravega while the bindings are not always required for the execution of the stream store.
- Incorrect usages. Usually the bindings are loaded to serve the standalone mode of deployment. However, they are subject to misuse. For example, hdfs binding is used to launch an distributed file system cluster over ephemeral storage all within a single node. Such deployments should not be in production.
Key Design Changes
Proposal
Below is the summary of this proposal.
Step 1: Implement the bindings in a separate repository.
- The bindings project will be completely removed from the Pravega source. It serves to provide implementations for SyncStorage which will be resolved dynamically.
- It will be hosted in a separate repository where it can be serviced more frequently as and when the external dependencies provided updates.
Step 2: Re-Implement Bindings to load the class libraries.
- NFS
- HDFS
- Extended S3
Step 3: Re-implement segment store to load the class libraries:
- ChunkManager
- Metadata table store and
- Chunk Storage provider Earlier these used to correspond to
- No-op storage
- Asynchronous storage
- Rolling storage Use ChunkStorageProvider not just in a simulated environment but also in the actual deployment corresponding to No-op, asynchronous and Rolling storage.
API Changes
- There are no user level API changes to ChunkStorageProvider.
- There are no changes to how tier-2 is configured.
API Testing
- All usages of ChunkStorageProvider from segmentstore corresponding to No-op, async and rolling storages will be exercised.
- Both standalone mode and distributed mode of deployment will be used. InMemory Cluster testing is not part of the feature parity.
- Testing will include byte level parity between segments in a stream on one storage adapter versus another.
Internal Changes
Architecture Change
Please refer PDP-34 for the changes to the internals of the binding. Instead, this will focus on the changes in the usages of the binding. This PDP could be implemented before or after PDP-34. Either way, it is not impacted. Earlier, there was no changes to the host in the segment store which used to persist the stream to file, blobs or hdfs. In this case, the segmentstore will maintain the same providers but without requiring them at compile time. The interface used by the segmentstore server host is SyncStorage and the bindings is responsible to implement that interface. There is a clear separation between bindings users and SyncStorage
Internal API changes
- Tier-2 storage providers
- Operate only at chunk level.
- This works for file, blob and hdfs in no-op, asynchronous and rolling mode.
Internal Functionality changes
- Storage related metadata
- All storage related metadata is stored in table segments.
- Ability to import and export this data in backward compatible
- Consolidation of functionality
- SegmentStore will not be aware of the internals of bindings
- Bindings will be loaded at runtime and operated via a unified interface.
Key Concepts
SyncStorage
- This is the interface used to refer to nfs, extendeds3 or hdfs storage.
- RollingStorage also implements this interface but it is a pseudo-storage because it is a layer above one of three mentioned
- The choice of SyncStorage is resolved by StorageLoader that resolves the StorageFactory by name which is specified in the ServiceConfig properties with key storageImplementation. Each StorageFactoryImplementation is dependent on the StorageAdapter configured. The call sequence looks like this:
SegmentStore- StorageLoader ServiceLoader StorageFactoryCreator StorageFactory
-HostServiceStarter
| -> configSetup
| ----------> load()
<---------- loader
| -------------------------------> createFactory()
<---------- factory()
| ---------------------------------------------------------------->attachStorage()
<---------------------------------------------------------------
- This implementation remains the same. Only the NoopStorageProvider is retained
Test refactoring
- Bulk of the changes in this code is the refactoring required in test
- The adapters used by integration tests include SegmentStoreAdapter, AppendProcessorAdapter, InProcessListenerWithRealStoreAdapter. These will move to the bindings repository
- Although adapters are independently written by test, they will need to be included differently in the test project which does not make the integration test pruning as clean as the main source
- The standalone case also needs rework. Even though the local filesystem is used in the standalone case and the stream store is hosted in memory, the file current code relies on refactoring and this will have to change.
Key Concerns
Bootstrap
- Startup involves configuration and this has to be passed to the StorageLoader. With the separation of the bindings, there is no specific way to validate these. A storageFactoryCreator is found only if the name matches. Currently, there is no validation possible other than the load failure.
- The bindings continue to be a service. Their load is independent of the main source and may result in failures. The logs will indicate if Pravega is not able to startup. It can be immediately rectified with including the correct jars with Pravega.
- Compatibility
Byte level parity
- List all segments in a stream. Ensure that the byte level parity exists before and after separation.
- Change adapters. Repeat above
Backup/Restore
- Import segment metadata previously imported to a file.
Key Questions Answered
- Will the bindings work if tier-1 is changed to something other than Bookkeeper? Yes, that is not a problem, this impacts only tier 2
- The Pravega source and the bindings rely on each other No. the bindings will compile and build independently Callouts
API contracts
The following classes and contracts will move to the bindings repository.
- io.pravega.segmentstore.storage.Storage;
- io.pravega.segmentstore.storage.rolling.RollingSegmentHandle;
- io.pravega.segmentstore.contracts.StorageNotPrimaryException;
Assumptions
Tier 2 is a layer below the stream store and does not have to be included other than for persistence. Pravega can be entirely in memory.
Priorities and Tradeoffs
- Prefer repository refactoring rather than code refactor
- Load bindings at runtime
- The tradeoff of this separation is maintainability of code, servicing and building the other automations. Pros and Cons The upside:
- Support for wider variety of Tier 2 implementations
- Less metadata in Tier 2 (no more Header files)
- Bindings separated for test versus production deployments
- Configurable as usual
- Standalone usage of binding continues The downside:
- Resolving storage provider at run time loses integration value
- Need to add different integration tests.
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (10 by maintainers)
Top GitHub Comments
As NFS tier2 is basically about using the FILESYSTEM, would it be an option to at least keep it into main repository? It uses only JRE libraries, it does not import any additional library. This way building Pravega results it a fully working package, with persistent tier2 that is very useful for testing integrated products and for demos.
There is another aspect that needs to be taken into account: currently the docker image has the logic that configures one of the supported storage bindings and with the bindings being a runtime dependency now there should be some other way to process them at startup.