Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Stateless Ingestion causes previous data to be overriden.

See original GitHub issue

Describe the bug

I am using this example to tag columns in a table. One issue I noticed is graph.get_aspect_v2 part where you always have to make a GET request to the server first to obtain all existing tags; then append if it’s a new tag; and then emit it to DataHub.

I find this design a little bit odd that client side has to know what all the tags are, and then server side is completely stateless. I attempted to bypass this getting the aspect and tried out to just construct MetadataChangeProposalWrapper with GlobalTagsClass(tags=[tag_association_to_add]) no matter what the state is. I noticed that this removes all the other tags. I was expecting that this would append only the tag that I am attempting to add, not remove other tags.

Is this intended by design? Is there a way to change this by having a flag or any other way to submit? One big issue here is the race condition, if I am submitting these changes through kafka events (or even synchronous parallel way) and there happens to be multiple MCPW of the same column, other tags could be lost.

To Reproduce An attempt to make a stateless metadata change would look like this:

    public static MetadataChangeProposalWrapper createTagChange(String assetUrn, String column, String tagUrn) throws URISyntaxException {
        TagAssociation tagAssociation = new TagAssociation().setTag(TagUrn.createFromString(tagUrn));
        GlobalTags globalTags = new GlobalTags().setTags(new TagAssociationArray(tagAssociation));
        EditableSchemaFieldInfo editableSchemaFieldInfo = new EditableSchemaFieldInfo().setFieldPath(column).setGlobalTags(globalTags);
        EditableSchemaMetadata editableSchemaMetadata = new EditableSchemaMetadata()
                .setEditableSchemaFieldInfo(new EditableSchemaFieldInfoArray(editableSchemaFieldInfo))
                .setCreated(createCurrentAuditStamp());
        return MetadataChangeProposalWrapper.builder()
                .entityType("dataset").entityUrn(assetUrn).upsert().aspect(editableSchemaMetadata).build();
    }

Then just emiting this with new KafkaEmitter(config).emit(createTagChange(urn,column,myTag)) would result existing tags to be removed.

Expected behavior (In the above example) if I have 3 tags associated to a column and then I’m adding 4th myTag. I would expect in this case to have 4 tags including myTag however, the existing behavior is; it removes 3 tags and there’s only myTag

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:8 (1 by maintainers)

Top GitHub Comments

2reactions

RyanHolstiencommented, Aug 15, 2022

This is current expected behavior. All MetadataChangeProposals currently come across with ChangeType.UPSERT indicating that it is intended to be a full replacement operation. We are actively working on PATCH changetype semantics and will be rolling it out to different aspects once the behavior is supported.

The current way to get around this is to do a Read -> Modify -> Write

1reaction

RyanHolstiencommented, Aug 15, 2022

@RyanHolstien how can I do Modify and then Write? Looking through the docs I don’t really see anything about that.

Also could you please confirm what happens in this scenario: Given two metadata changes; m1 and m2 on the same resource + column but they are both adding different tags.

Times represented from t0-t5

t0: m1 reads the aspect t1: m2 reads the aspect t2: m1 modifies the metadata change proposal t3: m1 writes the metadata change proposal t4: m2 modifies t5: m2 writes.

In that case wouldn’t m2 still be writing the old aspect that was valid at t1?

@sarpk By Read -> Modify -> Write, I mean within your application using the SDK to send MCPs you would perform a GET -> some code -> POST. As you pointed out, this does require the application to perform these operations synchronously as there is no locking in this scenario. With Patch, the operation will all be done in a single atomic DB transaction.

@HunterEl We don’t currently have it tracked on the OSS roadmap as the effort was requested by a customer and not the community, but keep an eye out for a PR coming in the next couple weeks or so for the initial work on the OSS side 😄

Top Results From Across the Web

Accelerating stateless model evaluation on Vespa

Stateless model evaluation happens on the container nodes and is characterized by a single model evaluation per query or document. Stateful ...

How stateful MIGs work - Compute Engine - Google Cloud

The stateful policy declares data-disk as stateful. The boot disk remains stateless. Note that the disk with device name, data-disk , must be...

Chapter 1. Reliable, Scalable, and Maintainable Applications

While distributing stateless services across multiple machines is fairly straightforward, taking stateful data systems from a single node to a distributed setup ...

Spark Streaming Programming Guide

A SparkContext can be re-used to create multiple StreamingContexts, as long as the previous StreamingContext is stopped (without stopping the SparkContext) ...

Stream Processing Access Logs: LoKI Stack - Medium

Perform stateless and stateful operations. Able to ingest Tbs of Data daily with thousands of RPS. Asynchronous execution. Minimum maintenance ...