question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

WriteColumnAsync throwing exception when serializing Int32

See original GitHub issue

Issue description

Issue:

I have the following function that serializes a datatable into a memory stream. For some reason when it tries to serialize the first column “ID” which is of type System.Int32, I get the error “failed to encode data page data for column ID (System.Int32)” The overall goal is to serialize a generic object. The service reads in settings from a database that has table names that indicate what the service should serialize and store so I generate a schema as shown below in the GenerateSchema() code and then serialize the data in CreateStreamFromDataTable(). Any idea why the line await rgw.WriteColumnAsync(new Parquet.Data.DataColumn(fields[i], valuesArray)); would throw that error (failed to encode data page data for column ID (System.Int32)) when the type is int?

Stack Trace:

at Parquet.File.DataColumnWriter.WriteColumnAsync(DataColumn column, SchemaElement tse, Int32 maxRepetitionLevel, Int32 maxDefinitionLevel, CancellationToken cancellationToken) at Parquet.File.DataColumnWriter.WriteAsync(FieldPath fullPath, DataColumn column, CancellationToken cancellationToken) at Parquet.ParquetRowGroupWriter.WriteColumnAsync(DataColumn column, CancellationToken cancellationToken) at ParquetArchiverService.CreateStreamFromDataTable(IReadOnlyList`1 fields, DataTable dt, String tableName) in C:\ParquetArchiverService.cs:line 363

Code:

    private List<DataField> GenerateSchema(DataTable dt)
    {
        var fields = new List<DataField>(dt.Columns.Count);
        try
        {
            foreach (DataColumn column in dt.Columns)
            {
                if (column.DataType == typeof(Guid))
                {
                    fields.Add(new DataField(column.ColumnName, typeof(string)));
                }
                else
                {
                    fields.Add(new DataField(column.ColumnName, column.DataType));
                }
            }
            return fields;
        }
        catch (Exception ex)
        {
            _logger.LogError(ex.Message);
            return fields;
        }
    }

    private async Task<Stream> CreateStreamFromDataTable(IReadOnlyList<DataField> fields, DataTable dt, string tableName)
    {
            _logger.LogInformation(CONVERTING_DATA_TO_PARQUET);
             // Open the output stream for writing
            var stream = new MemoryStream();
            using var writer = await ParquetWriter.CreateAsync(new ParquetSchema(fields), stream);
            var startRow = 0;

            _logger.LogInformation($"Converting {tableName} data to Parquet format");

        try
        {
            // Keep on creating row groups until we run out of data
            while (startRow < dt.Rows.Count)
            {
                using (var rgw = writer.CreateRowGroup())
                {
                    // Data is written to the row group column by column
                    for (var i = 0; i < dt.Columns.Count; i++)
                    {
                        var columnIndex = i;

                        // Determine the target data type for the column
                        var targetType = dt.Columns[columnIndex].DataType;
                        if (targetType == typeof(DateTime)) targetType = typeof(DateTimeOffset);
                        if (targetType == typeof(Guid)) targetType = typeof(string);

                        // Generate the value type, this is to ensure it can handle null values
                        var valueType = targetType.IsClass
                            ? targetType
                            : typeof(Nullable<>).MakeGenericType(targetType);

                        // Create a list to hold values of the required type for the column
                        var list = (IList)typeof(List<>)
                            .MakeGenericType(valueType)
                            .GetConstructor(Type.EmptyTypes)
                            ?.Invoke(null)!;

                        // Get the data to be written to the parquet stream
                        foreach (var row in dt.AsEnumerable().Skip(startRow).Take(ROW_GROUP_SIZE))
                        {
                            // Check if value is null, if so then add a null value
                            if (row[columnIndex] == DBNull.Value)
                            {
                                list.Add(null);
                            }
                            else
                            {
                                if (dt.Columns[columnIndex].DataType == typeof(DateTime))
                                {
                                    list.Add(new DateTimeOffset((DateTime)row[columnIndex]));
                                }
                                else if (dt.Columns[columnIndex].DataType == typeof(Guid))
                                {
                                    var success = Guid.TryParse(row[columnIndex].ToString(), out var guid);
                                    if (success)
                                    {
                                        list.Add(guid.ToString("D"));
                                    }
                                    else
                                    {
                                        _logger.LogError(TRY_PARSE_GUID_FAILED);
                                        var agencyGuid = new Guid(Strings.AGENCY_ID_ADMIN);
                                        list.Add(agencyGuid.ToString("D"));
                                    }
                                }
                                else
                                {
                                    list.Add(row[columnIndex]);
                                }
                            }
                        }

                        // Copy the list values to an array of the same type as the WriteColumn method expects
                        // and Array
                        var valuesArray = Array.CreateInstance(valueType, list.Count);
                        list.CopyTo(valuesArray, 0);

                        // Write the column
                        await rgw.WriteColumnAsync(new Parquet.Data.DataColumn(fields[i], valuesArray));
                    }
                }
                startRow += ROW_GROUP_SIZE;
            }
        }
        catch (Exception ex)
        {
            var errorInfo = $"Error occured while converting {tableName} data to parquet format.";
            _logger.LogError(ex, errorInfo);
        }

        return stream;
    }

Issue Analytics

  • State:closed
  • Created 7 months ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
ghostcommented, Mar 16, 2023

Ah I may have found the issue. The DataType enum was giving me System.Int32? for the DataField.DataType but the System.Type was not nullable. When gerating the schema I had to use

fields.Add(new DataField(column.ColumnName, typeof(Nullable<>).MakeGenericType(column.DataType)));

This issue can be closed.

0reactions
aloneguidcommented, Mar 17, 2023

I’m glad you were able to find the issue and fix it. That’s awesome!👏

Thank you for sharing your solution with us. It will be extremely helpful for other users who might encounter the same problem.

I appreciate your contribution to our project. You are doing great work! 😊

I will close this issue as resolved. Please let me know if you have any other questions or feedback.

Have a wonderful day!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why does System.Text.Json throw a ` ...
Your code is failing during serialization not deserialization because you are catching some inner exception and trying to serialize it with ...
Read more >
Bug in Json.Net: System.Int32 is deserialized as System.Int64
My state object produces errors during serialization in the Boolean object. Do I need to update to 1.1.1?
Read more >
Migrate from Newtonsoft.Json to System.Text.Json - .NET
Deserialize null to non-nullable type. Newtonsoft.Json doesn't throw an exception in the following scenario: NullValueHandling is set to Ignore ...
Read more >
How to write custom converters for JSON serialization - .NET
If you throw a JsonException without a message, the serializer creates a message that includes the path to the part of the JSON...
Read more >
How to not shoot yourself in the foot when working with ...
To pass the exception object between the application domains, we have its serialization and deserialization. Accordingly, the types of exception ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found