Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Form recognizer serialization

See original GitHub issue

Is your feature request related to a problem? Please describe. I would like to be able to retrieve the JSON output from Form Recognizer using the python Form Recognizer SDK.

For example in sample_convert_to_and_from_dict.py, the result is serialised in data.json by:

Obtaining a form recognizer result (an AnalyzeResult object)
Converting to a dictionary with the to_dict method
Serialising to JSON with json.dump and AzureJSONEncoder encoder

#result obtained from document_analysis_client.begin_analyze_document(*args).result() analyze_result_dict = result.to_dict() with open('data.json', 'w') as f:
json.dump(analyze_result_dict, f, cls=AzureJSONEncoder)

There are quite a few differences between data.json and submitting form_1.jpg to form recognizer directly (fr.json), e.g. via form recognizer studio and downloading the result manually, these include:

fr.json has a header including keys "status", "createdDateTime", "lastUpdatedDateTime", "analyzeResult" with the value of "analyzeResult" more or less corresponding to data.json.
Some fields need a null check in data.json, e.g. "angle"
data.json uses underscore case whilst fr.json uses lower camel case, e.g. “bounding_regions” vs “boundingRegions”
Many keys have different names, for instance for storing fields, data.json uses {"value_type":"string", "value":"some words"}, whilst fr.json uses {"type":"string", "valueString":"some words"}
Some of the values are encoded differently, for example “spans”.
In data.json, "spans": [ { "offset": 0, "length": 24858 } ] in fr.json "spans": { "offset": 0, "length": 24858 } i.e. in data.json we have instead a array containing the object as opposed to just the object
bounding boxes are encoded differently, in data.json this is an array of four (x,y) pairs (encoded as objects with keys “x”, “y” and float values) whilst in fr.json, they are a array of 8 floats. I.E.

In data.json "polygon": [ { "x": 3.184, "y": 0.6537 }, { "x": 5.1022, "y": 0.6537 }, { "x": 5.1022, "y": 0.8722 }, { "x": 3.184, "y": 0.8722 } ]

in fr.json "polygon": [ 3.184, 0.6537, 5.1022, 0.6537, 5.1022, 0.8722, 3.184, 0.8722 ]

Describe the solution you’d like Changes to to_dict (and corresponding from_dict) methods in _models.py to correct problems 2, 3, 4, 6 2. Change some lines to have null checks, e.g. "angle": self.angle if self.angle else 0 3. In to_dict methods, use camel case for keys 4. Change these to use different names, for instance in our example, we would do "value_type":self.value_type, "value":value to "type": self.value_type, f"value{self.value_type.capitalize()}": value 6. Unpack the point objects, e.g. change lines in to_dict methods e.g. "bounding_box": [f.to_dict() for f in self.bounding_box]to "boundingBox": [v for f in self.bounding_box for _, v in f.to_dict().items()]

is probably workable with on the user end, and I have no idea about 5.

Describe alternatives you’ve considered

Make Rest API calls from Python: probably defeats the point of using the Python SDK
Use different programming language SDKs: haven’t actually checked these SDKs do not have similar problems?
Add a separate to_json or to_json_string method: Maybe a good non-breaking solution? Probably lots of code duplication?
Make a different encoder inheriting from AzureJSONEncoder: Possibly a good non-breaking solution and prevents duplication of edits between to_dict and from_dict methods if a corresponding decoder is made?

could wrap key name strings in objects with hash method and toCamel method, and change to_dict to return a something like a default_dict which gets and unwraps objects. The encoder would then call the toCamel method to get the string back in camel case. Would need to check this would not stop use of underscores in field names for custom_models? Also pretty hacky and unperformant.
Problems handling problem 6, unless JSON encoding supports like HKT, might be harder to encode a list Point objects to a list of floats than to unpack them in the dict?
Possible fix for problem 5?

Additional context Sorry if there is a different way to achieve this and for my markdown.

Issue Analytics

State:
Created 10 months ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

qlbpcommented, Nov 18, 2022

Yeah that’s a really nice solution for my issue, thank you very much @catalinaperalta!

0reactions

catalinaperaltacommented, Nov 18, 2022

I see, thanks for the info. To get the raw response from the SDK you can use a callback in the following manner:

def callback(raw_response, _, headers):
        # process raw response as needed
        print(raw_response.http_response.body())

poller = document_analysis_client.begin_analyze_document(
            "<model Id>", document=f, cls=callback
        )
result = poller.result()

Let me know if that helps!

Top Results From Across the Web

Serialization for RecognizedFormCollection · Issue #21982 ...

What is the correct way to handle serializing/deserializing Form Recognizer responses to and from JSON format using the Azure SDK?

Azure Form Recognizer Python SDK results much different ...

Azure Form Recognizer Python SDK results much different than UI results with unwanted fields and not able to serialize in Json. Dear All,....

azure.ai.formrecognizer package - NET

FormRecognizerClient extracts information from forms and images into structured data. It is the interface to use for analyzing receipts, business cards, ...

deserialize RecognizedForm object from json (Azure Form ...

I need to deserialize json-serialized Azure Form Recognizer results ... just the shape of a RecognizedForm object), this might work for you:

Azure SDK for .NET (August 2020)

Preview. Event Hubs; Form Recognizer; Service Bus ... Added ObjectSerializer base class for serialization. Added JsonObjectSerializer that implements ...