question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Separate bindings from Pravega

See original GitHub issue

Problem description This issue tracks the initiative to separate bindings from Pravega source

Problem location The source will be moved from project(‘bindings’) to its own repository

Suggestions for an improvement A draft for the corresponding PDP is included below.

Associated code changes: https://github.com/pravega/pravega-bindings/pull/1 https://github.com/pravega/pravega/pull/4908

Status: Draft

Summary

This PDP discusses a change to not require S3, hdfs and file-system bindings to be build with the Pravega because they incur a lot of technical debt. Once implemented, this proposal will allow the bindings to be discovered at runtime and used independently. The project(bindings) will be removed from build.gradle and will appear in its own repository.

Shortcomings of the current implementation

  1. External dependency. The current implementation of Bindings is at the mercy of all defects from external service providers such as HDFS, extended S3 and file-system implementors. Frequently, they change their methods and properties in their apis and cause break-in changes in Pravega while the bindings are not always required for the execution of the stream store.
  2. Incorrect usages. Usually the bindings are loaded to serve the standalone mode of deployment. However, they are subject to misuse. For example, hdfs binding is used to launch an distributed file system cluster over ephemeral storage all within a single node. Such deployments should not be in production.

Key Design Changes

Proposal

Below is the summary of this proposal.

Step 1: Implement the bindings in a separate repository.

  • The bindings project will be completely removed from the Pravega source. It serves to provide implementations for SyncStorage which will be resolved dynamically.
  • It will be hosted in a separate repository where it can be serviced more frequently as and when the external dependencies provided updates.

Step 2: Re-Implement Bindings to load the class libraries.

  • NFS
  • HDFS
  • Extended S3

Step 3: Re-implement segment store to load the class libraries:

  • ChunkManager
  • Metadata table store and
  • Chunk Storage provider Earlier these used to correspond to
  • No-op storage
  • Asynchronous storage
  • Rolling storage Use ChunkStorageProvider not just in a simulated environment but also in the actual deployment corresponding to No-op, asynchronous and Rolling storage.

API Changes

  • There are no user level API changes to ChunkStorageProvider.
  • There are no changes to how tier-2 is configured.

API Testing

  • All usages of ChunkStorageProvider from segmentstore corresponding to No-op, async and rolling storages will be exercised.
  • Both standalone mode and distributed mode of deployment will be used. InMemory Cluster testing is not part of the feature parity.
  • Testing will include byte level parity between segments in a stream on one storage adapter versus another.

Internal Changes

Architecture Change

Please refer PDP-34 for the changes to the internals of the binding. Instead, this will focus on the changes in the usages of the binding. This PDP could be implemented before or after PDP-34. Either way, it is not impacted. Earlier, there was no changes to the host in the segment store which used to persist the stream to file, blobs or hdfs. In this case, the segmentstore will maintain the same providers but without requiring them at compile time. The interface used by the segmentstore server host is SyncStorage and the bindings is responsible to implement that interface. There is a clear separation between bindings users and SyncStorage

Internal API changes

  • Tier-2 storage providers
  • Operate only at chunk level.
  • This works for file, blob and hdfs in no-op, asynchronous and rolling mode.

Internal Functionality changes

  • Storage related metadata
  • All storage related metadata is stored in table segments.
  • Ability to import and export this data in backward compatible
  • Consolidation of functionality
  • SegmentStore will not be aware of the internals of bindings
  • Bindings will be loaded at runtime and operated via a unified interface.

Key Concepts

SyncStorage

  • This is the interface used to refer to nfs, extendeds3 or hdfs storage.
  • RollingStorage also implements this interface but it is a pseudo-storage because it is a layer above one of three mentioned
  • The choice of SyncStorage is resolved by StorageLoader that resolves the StorageFactory by name which is specified in the ServiceConfig properties with key storageImplementation. Each StorageFactoryImplementation is dependent on the StorageAdapter configured. The call sequence looks like this:
SegmentStore-         StorageLoader    ServiceLoader       StorageFactoryCreator        StorageFactory
-HostServiceStarter
 | -> configSetup
                     | ---------->   load()
                       <----------   loader
                     | -------------------------------> createFactory()
                       <----------   factory()
                     | ---------------------------------------------------------------->attachStorage()
                       <---------------------------------------------------------------
                     
  • This implementation remains the same. Only the NoopStorageProvider is retained

Test refactoring

  • Bulk of the changes in this code is the refactoring required in test
  • The adapters used by integration tests include SegmentStoreAdapter, AppendProcessorAdapter, InProcessListenerWithRealStoreAdapter. These will move to the bindings repository
  • Although adapters are independently written by test, they will need to be included differently in the test project which does not make the integration test pruning as clean as the main source
  • The standalone case also needs rework. Even though the local filesystem is used in the standalone case and the stream store is hosted in memory, the file current code relies on refactoring and this will have to change.

Key Concerns

Bootstrap

  1. Startup involves configuration and this has to be passed to the StorageLoader. With the separation of the bindings, there is no specific way to validate these. A storageFactoryCreator is found only if the name matches. Currently, there is no validation possible other than the load failure.
  2. The bindings continue to be a service. Their load is independent of the main source and may result in failures. The logs will indicate if Pravega is not able to startup. It can be immediately rectified with including the correct jars with Pravega.
  3. Compatibility

Byte level parity

  • List all segments in a stream. Ensure that the byte level parity exists before and after separation.
  • Change adapters. Repeat above

Backup/Restore

  • Import segment metadata previously imported to a file.

Key Questions Answered

  1. Will the bindings work if tier-1 is changed to something other than Bookkeeper? Yes, that is not a problem, this impacts only tier 2
  2. The Pravega source and the bindings rely on each other No. the bindings will compile and build independently Callouts

API contracts

The following classes and contracts will move to the bindings repository.

  • io.pravega.segmentstore.storage.Storage;
  • io.pravega.segmentstore.storage.rolling.RollingSegmentHandle;
  • io.pravega.segmentstore.contracts.StorageNotPrimaryException;

Assumptions

Tier 2 is a layer below the stream store and does not have to be included other than for persistence. Pravega can be entirely in memory.

Priorities and Tradeoffs

  1. Prefer repository refactoring rather than code refactor
  2. Load bindings at runtime
  3. The tradeoff of this separation is maintainability of code, servicing and building the other automations. Pros and Cons The upside:
  • Support for wider variety of Tier 2 implementations
  • Less metadata in Tier 2 (no more Header files)
  • Bindings separated for test versus production deployments
  • Configurable as usual
  • Standalone usage of binding continues The downside:
  • Resolving storage provider at run time loses integration value
  • Need to add different integration tests.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:10 (10 by maintainers)

github_iconTop GitHub Comments

2reactions
eolivellicommented, Jun 29, 2020

As NFS tier2 is basically about using the FILESYSTEM, would it be an option to at least keep it into main repository? It uses only JRE libraries, it does not import any additional library. This way building Pravega results it a fully working package, with persistent tier2 that is very useful for testing integrated products and for demos.

1reaction
medvedevigorekcommented, Jun 29, 2020

There is another aspect that needs to be taken into account: currently the docker image has the logic that configures one of the supported storage bindings and with the bindings being a runtime dependency now there should be some other way to process them at startup.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Python Bindings - Pravega Native Client
If multiple readers are created under the same reader group then the readers co-ordinate amongst themselves and divide the responsibility of reading from ......
Read more >
Pravega Roadmap
Spillover from 0.9; Separate LTS configurations per scope ... Spillover from 0.10; Refactor Tier-1; Native client bindings (phase 2: Golang) ...
Read more >
PDP 34 (Simplified Tier 2) · pravega/pravega Wiki - GitHub
None of the current Tier 2 bindings in Pravega support all of these, ... Step 3 (to be implemented in future as a...
Read more >
Pravega: Storage for data streams - YouTube
Pravega : Storage for data streamsFlavio JunqueiraA presentation from ApacheCon @Home 2020https://apachecon.com/acah2020/There is no shortage ...
Read more >
Cloud-Native Event Streaming With Pravega | by Alex
Something with storage, an easy interface, and hopefully some python bindings that I could hook into relatively easily. I found two extremely ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found