Form recognizer serialization
See original GitHub issueIs your feature request related to a problem? Please describe. I would like to be able to retrieve the JSON output from Form Recognizer using the python Form Recognizer SDK.
For example in sample_convert_to_and_from_dict.py
, the result is serialised in data.json by:
- Obtaining a form recognizer result (an AnalyzeResult object)
- Converting to a dictionary with the to_dict method
- Serialising to JSON with json.dump and AzureJSONEncoder encoder
#result obtained from document_analysis_client.begin_analyze_document(*args).result()
analyze_result_dict = result.to_dict()
with open('data.json', 'w') as f:
json.dump(analyze_result_dict, f, cls=AzureJSONEncoder)
There are quite a few differences between data.json and submitting form_1.jpg to form recognizer directly (fr.json), e.g. via form recognizer studio and downloading the result manually, these include:
-
fr.json has a header including keys
"status", "createdDateTime", "lastUpdatedDateTime", "analyzeResult"
with the value of"analyzeResult"
more or less corresponding to data.json. -
Some fields need a null check in data.json, e.g.
"angle"
-
data.json uses underscore case whilst fr.json uses lower camel case, e.g. “bounding_regions” vs “boundingRegions”
-
Many keys have different names, for instance for storing fields, data.json uses
{"value_type":"string", "value":"some words"}
, whilst fr.json uses{"type":"string", "valueString":"some words"}
-
Some of the values are encoded differently, for example “spans”.
In data.json,"spans": [ { "offset": 0, "length": 24858 } ]
in fr.json"spans": { "offset": 0, "length": 24858 }
i.e. in data.json we have instead a array containing the object as opposed to just the object -
bounding boxes are encoded differently, in data.json this is an array of four (x,y) pairs (encoded as objects with keys “x”, “y” and float values) whilst in fr.json, they are a array of 8 floats. I.E.
In data.json "polygon": [ { "x": 3.184, "y": 0.6537 }, { "x": 5.1022, "y": 0.6537 }, { "x": 5.1022, "y": 0.8722 }, { "x": 3.184, "y": 0.8722 } ]
in fr.json "polygon": [ 3.184, 0.6537, 5.1022, 0.6537, 5.1022, 0.8722, 3.184, 0.8722 ]
Describe the solution you’d like
Changes to to_dict (and corresponding from_dict) methods in _models.py to correct problems 2, 3, 4, 6
2. Change some lines to have null checks, e.g. "angle": self.angle if self.angle else 0
3. In to_dict methods, use camel case for keys
4. Change these to use different names, for instance in our example, we would do
"value_type":self.value_type, "value":value
to "type": self.value_type, f"value{self.value_type.capitalize()}": value
6. Unpack the point objects, e.g. change lines in to_dict methods e.g.
"bounding_box": [f.to_dict() for f in self.bounding_box]
to "boundingBox": [v for f in self.bounding_box for _, v in f.to_dict().items()]
- is probably workable with on the user end, and I have no idea about 5.
Describe alternatives you’ve considered
- Make Rest API calls from Python: probably defeats the point of using the Python SDK
- Use different programming language SDKs: haven’t actually checked these SDKs do not have similar problems?
- Add a separate to_json or to_json_string method: Maybe a good non-breaking solution? Probably lots of code duplication?
- Make a different encoder inheriting from AzureJSONEncoder: Possibly a good non-breaking solution and prevents duplication of edits between to_dict and from_dict methods if a corresponding decoder is made?
- could wrap key name strings in objects with hash method and toCamel method, and change to_dict to return a something like a default_dict which gets and unwraps objects. The encoder would then call the toCamel method to get the string back in camel case. Would need to check this would not stop use of underscores in field names for custom_models? Also pretty hacky and unperformant.
- Problems handling problem 6, unless JSON encoding supports like HKT, might be harder to encode a list Point objects to a list of floats than to unpack them in the dict?
- Possible fix for problem 5?
Additional context Sorry if there is a different way to achieve this and for my markdown.
Issue Analytics
- State:
- Created 10 months ago
- Comments:6 (3 by maintainers)
Top GitHub Comments
Yeah that’s a really nice solution for my issue, thank you very much @catalinaperalta!
I see, thanks for the info. To get the raw response from the SDK you can use a callback in the following manner:
Let me know if that helps!