question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unicode characters get transformed into surrogate pairs by graphql.print_ast()

See original GitHub issue

I’ve run into an issue where serializing a GraphQL DocumentNode into a string, and then parsing that back into a DocumentNode transforms certain unicode characters into surrogate pairs, which make them no longer UTF-8 encodeable.

This code snippet demonstrates the problem:

import graphql

value = "\U000a90e5"
print(f"Value before serializing: {value!r}")

encoded = value.encode("utf8")
print(f"UTF-8 encoded: {encoded!r}")

query = graphql.DocumentNode(
    definitions=[
        graphql.OperationDefinitionNode(
            operation=graphql.OperationType.QUERY,
            selection_set=graphql.SelectionSetNode(
                selections=[
                    graphql.FieldNode(
                        name=graphql.NameNode(
                            kind="name",
                            value="hello",
                        ),
                        arguments=[
                            graphql.ArgumentNode(
                                name=graphql.NameNode(value="user"),
                                value=graphql.StringValueNode(value=value)
                            )
                        ]
                    )
                ]
            )
        )
    ]
)

serialized_query = graphql.print_ast(query)

print(f"Serialized query: {serialized_query}")

parsed_query = graphql.parse(serialized_query)

value = parsed_query.definitions[0].selection_set.selections[0].arguments[0].value.value
print(f"Value after serializing: {value!r}")

encoded = value.encode("utf8")
print(f"UTF-8 encoded: {encoded!r}")

Given the unicode character \U000a90e5 which is UTF-8 encodeable, passing this value to a DocumentNode tree and serializing the AST into text transforms the character into the surrogate pair \uda64\udce5. Converting this back into a DocumentNode via graphql.parse() and then extracting the argument value shows that it has been modified. And it is no longer UTF-8 encodeable. The last line in this snippet produces the error: UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed

Environment:

  • Python 3.8.5
  • graphql-core 3.1.4

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
Citocommented, Apr 15, 2021

Regarding your question, there are good reasons (see also “Goals and Restrictions” in the README): First, GraphQL.js was originally made by Facebook and was written by Lee Byron who is also the co-author of GraphQL. Meanwhile, it is developed as part of the GraphQL foundation. So this library is probably the closest to the specs and it is also continually updated. Second, I needed to restrict the scope of the project in order to be able to maintain it for a long time in a sustainable way, since I work on this only in my spare time. It surely is possible to create a GraphQL implementation in Python that is more performant, more Pythonic or has more features, but that’s outside the declared scope of this project. Maybe an idea for another project.

0reactions
Citocommented, Dec 27, 2021

The PR mentioned here has meanwhile be ported to GraphQL-core and is available since v3.2.0rc1.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Surrogate pairs and variation selectors - Globalization
With surrogate pairs, a Unicode code point from range U+D800 to U+DBFF (called "high surrogate") gets combined with another Unicode code point ...
Read more >
What is a "surrogate pair" in Java? - Stack Overflow
The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme.
Read more >
Available CRAN Packages By Name
Available CRAN Packages By Name ; ACSWR, A Companion Package for the Book "A Course in Statistics with R" ; act, Aligned Corpus...
Read more >
Remove XML-invalid chars from a Unicode string or file - Ryadel
Today I was developing an Electron application for a client and I was looking for a way to remove invalid characters from a...
Read more >
Surrogate characters - IBM i
These characters have some special values; they are made up of two Unicode characters in two specific ranges such that the first Unicode...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found