question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Stateless Ingestion causes previous data to be overriden.

See original GitHub issue

Describe the bug

I am using this example to tag columns in a table. One issue I noticed is graph.get_aspect_v2 part where you always have to make a GET request to the server first to obtain all existing tags; then append if it’s a new tag; and then emit it to DataHub.

I find this design a little bit odd that client side has to know what all the tags are, and then server side is completely stateless. I attempted to bypass this getting the aspect and tried out to just construct MetadataChangeProposalWrapper with GlobalTagsClass(tags=[tag_association_to_add]) no matter what the state is. I noticed that this removes all the other tags. I was expecting that this would append only the tag that I am attempting to add, not remove other tags.

Is this intended by design? Is there a way to change this by having a flag or any other way to submit? One big issue here is the race condition, if I am submitting these changes through kafka events (or even synchronous parallel way) and there happens to be multiple MCPW of the same column, other tags could be lost.

To Reproduce An attempt to make a stateless metadata change would look like this:

    public static MetadataChangeProposalWrapper createTagChange(String assetUrn, String column, String tagUrn) throws URISyntaxException {
        TagAssociation tagAssociation = new TagAssociation().setTag(TagUrn.createFromString(tagUrn));
        GlobalTags globalTags = new GlobalTags().setTags(new TagAssociationArray(tagAssociation));
        EditableSchemaFieldInfo editableSchemaFieldInfo = new EditableSchemaFieldInfo().setFieldPath(column).setGlobalTags(globalTags);
        EditableSchemaMetadata editableSchemaMetadata = new EditableSchemaMetadata()
                .setEditableSchemaFieldInfo(new EditableSchemaFieldInfoArray(editableSchemaFieldInfo))
                .setCreated(createCurrentAuditStamp());
        return MetadataChangeProposalWrapper.builder()
                .entityType("dataset").entityUrn(assetUrn).upsert().aspect(editableSchemaMetadata).build();
    }

Then just emiting this with new KafkaEmitter(config).emit(createTagChange(urn,column,myTag)) would result existing tags to be removed.

Expected behavior (In the above example) if I have 3 tags associated to a column and then I’m adding 4th myTag. I would expect in this case to have 4 tags including myTag however, the existing behavior is; it removes 3 tags and there’s only myTag

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:8 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
RyanHolstiencommented, Aug 15, 2022

This is current expected behavior. All MetadataChangeProposals currently come across with ChangeType.UPSERT indicating that it is intended to be a full replacement operation. We are actively working on PATCH changetype semantics and will be rolling it out to different aspects once the behavior is supported.

The current way to get around this is to do a Read -> Modify -> Write

1reaction
RyanHolstiencommented, Aug 15, 2022

@RyanHolstien how can I do Modify and then Write? Looking through the docs I don’t really see anything about that.

Also could you please confirm what happens in this scenario: Given two metadata changes; m1 and m2 on the same resource + column but they are both adding different tags.

Times represented from t0-t5

t0: m1 reads the aspect t1: m2 reads the aspect t2: m1 modifies the metadata change proposal t3: m1 writes the metadata change proposal t4: m2 modifies t5: m2 writes.

In that case wouldn’t m2 still be writing the old aspect that was valid at t1?

@sarpk By Read -> Modify -> Write, I mean within your application using the SDK to send MCPs you would perform a GET -> some code -> POST. As you pointed out, this does require the application to perform these operations synchronously as there is no locking in this scenario. With Patch, the operation will all be done in a single atomic DB transaction.

@HunterEl We don’t currently have it tracked on the OSS roadmap as the effort was requested by a customer and not the community, but keep an eye out for a PR coming in the next couple weeks or so for the initial work on the OSS side 😄

Read more comments on GitHub >

github_iconTop Results From Across the Web

Accelerating stateless model evaluation on Vespa
Stateless model evaluation happens on the container nodes and is characterized by a single model evaluation per query or document. Stateful ...
Read more >
How stateful MIGs work - Compute Engine - Google Cloud
The stateful policy declares data-disk as stateful. The boot disk remains stateless. Note that the disk with device name, data-disk , must be...
Read more >
Chapter 1. Reliable, Scalable, and Maintainable Applications
While distributing stateless services across multiple machines is fairly straightforward, taking stateful data systems from a single node to a distributed setup ...
Read more >
Spark Streaming Programming Guide
A SparkContext can be re-used to create multiple StreamingContexts, as long as the previous StreamingContext is stopped (without stopping the SparkContext) ...
Read more >
Stream Processing Access Logs: LoKI Stack - Medium
Perform stateless and stateful operations. Able to ingest Tbs of Data daily with thousands of RPS. Asynchronous execution. Minimum maintenance ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found