WriteColumnAsync throwing exception when serializing Int32
See original GitHub issueIssue description
Issue:
I have the following function that serializes a datatable into a memory stream. For some reason when it tries to serialize the first column “ID” which is of type System.Int32, I get the error “failed to encode data page data for column ID (System.Int32)” The overall goal is to serialize a generic object. The service reads in settings from a database that has table names that indicate what the service should serialize and store so I generate a schema as shown below in the GenerateSchema() code and then serialize the data in CreateStreamFromDataTable(). Any idea why the line await rgw.WriteColumnAsync(new Parquet.Data.DataColumn(fields[i], valuesArray)); would throw that error (failed to encode data page data for column ID (System.Int32)) when the type is int?
Stack Trace:
at Parquet.File.DataColumnWriter.WriteColumnAsync(DataColumn column, SchemaElement tse, Int32 maxRepetitionLevel, Int32 maxDefinitionLevel, CancellationToken cancellationToken) at Parquet.File.DataColumnWriter.WriteAsync(FieldPath fullPath, DataColumn column, CancellationToken cancellationToken) at Parquet.ParquetRowGroupWriter.WriteColumnAsync(DataColumn column, CancellationToken cancellationToken) at ParquetArchiverService.CreateStreamFromDataTable(IReadOnlyList`1 fields, DataTable dt, String tableName) in C:\ParquetArchiverService.cs:line 363
Code:
private List<DataField> GenerateSchema(DataTable dt)
{
var fields = new List<DataField>(dt.Columns.Count);
try
{
foreach (DataColumn column in dt.Columns)
{
if (column.DataType == typeof(Guid))
{
fields.Add(new DataField(column.ColumnName, typeof(string)));
}
else
{
fields.Add(new DataField(column.ColumnName, column.DataType));
}
}
return fields;
}
catch (Exception ex)
{
_logger.LogError(ex.Message);
return fields;
}
}
private async Task<Stream> CreateStreamFromDataTable(IReadOnlyList<DataField> fields, DataTable dt, string tableName)
{
_logger.LogInformation(CONVERTING_DATA_TO_PARQUET);
// Open the output stream for writing
var stream = new MemoryStream();
using var writer = await ParquetWriter.CreateAsync(new ParquetSchema(fields), stream);
var startRow = 0;
_logger.LogInformation($"Converting {tableName} data to Parquet format");
try
{
// Keep on creating row groups until we run out of data
while (startRow < dt.Rows.Count)
{
using (var rgw = writer.CreateRowGroup())
{
// Data is written to the row group column by column
for (var i = 0; i < dt.Columns.Count; i++)
{
var columnIndex = i;
// Determine the target data type for the column
var targetType = dt.Columns[columnIndex].DataType;
if (targetType == typeof(DateTime)) targetType = typeof(DateTimeOffset);
if (targetType == typeof(Guid)) targetType = typeof(string);
// Generate the value type, this is to ensure it can handle null values
var valueType = targetType.IsClass
? targetType
: typeof(Nullable<>).MakeGenericType(targetType);
// Create a list to hold values of the required type for the column
var list = (IList)typeof(List<>)
.MakeGenericType(valueType)
.GetConstructor(Type.EmptyTypes)
?.Invoke(null)!;
// Get the data to be written to the parquet stream
foreach (var row in dt.AsEnumerable().Skip(startRow).Take(ROW_GROUP_SIZE))
{
// Check if value is null, if so then add a null value
if (row[columnIndex] == DBNull.Value)
{
list.Add(null);
}
else
{
if (dt.Columns[columnIndex].DataType == typeof(DateTime))
{
list.Add(new DateTimeOffset((DateTime)row[columnIndex]));
}
else if (dt.Columns[columnIndex].DataType == typeof(Guid))
{
var success = Guid.TryParse(row[columnIndex].ToString(), out var guid);
if (success)
{
list.Add(guid.ToString("D"));
}
else
{
_logger.LogError(TRY_PARSE_GUID_FAILED);
var agencyGuid = new Guid(Strings.AGENCY_ID_ADMIN);
list.Add(agencyGuid.ToString("D"));
}
}
else
{
list.Add(row[columnIndex]);
}
}
}
// Copy the list values to an array of the same type as the WriteColumn method expects
// and Array
var valuesArray = Array.CreateInstance(valueType, list.Count);
list.CopyTo(valuesArray, 0);
// Write the column
await rgw.WriteColumnAsync(new Parquet.Data.DataColumn(fields[i], valuesArray));
}
}
startRow += ROW_GROUP_SIZE;
}
}
catch (Exception ex)
{
var errorInfo = $"Error occured while converting {tableName} data to parquet format.";
_logger.LogError(ex, errorInfo);
}
return stream;
}
Issue Analytics
- State:
- Created 7 months ago
- Comments:5 (3 by maintainers)
Ah I may have found the issue. The DataType enum was giving me System.Int32? for the DataField.DataType but the System.Type was not nullable. When gerating the schema I had to use
fields.Add(new DataField(column.ColumnName, typeof(Nullable<>).MakeGenericType(column.DataType)));
This issue can be closed.
I’m glad you were able to find the issue and fix it. That’s awesome!👏
Thank you for sharing your solution with us. It will be extremely helpful for other users who might encounter the same problem.
I appreciate your contribution to our project. You are doing great work! 😊
I will close this issue as resolved. Please let me know if you have any other questions or feedback.
Have a wonderful day!