question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG]: Applying UDFs to TimestampType causes occasional exception

See original GitHub issue

Describe the bug When applying multiple UDFs to a dataframe an error is thrown when at least one of the target columns of the UDF is a TimestampType column.

To Reproduce

Below is a minimal reproduction you can use. It includes a comment on how to create the necessary parquet file. It must have 2 timestamp columns and a sufficient number of rows (~2k) for the issue to occur. It is launched via the below command:

spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2 --master local .\microsoft-spark-3-0_2.12-1.0.0.jar dotnet .\spark_cache_bug.dll
using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.Spark.Sql;
using Microsoft.Spark.Sql.Types;

using static Microsoft.Spark.Sql.Functions;

namespace spark_cache_bug
{
    public static class SparkGenerator
    {
        public static Func<Column, Column> Generate(StructType returnType, string column) =>
            Udf<Row>(row => new GenericRow(new string [] {"1970-01-02 00:00:00"}), returnType);
    }

    public static class Program
    {
        private static void Generate(Dictionary<string, Column> expressionMap, string column, DataFrame df)
        {
            var typeMapper = new Dictionary<string, DataType>();
            foreach (var f in df.Schema().Fields)
            {
                typeMapper.Add(f.Name, f.DataType);
            }

            var returnType = new StructType(new[] {new StructField(column, new StringType())});

            var udf = SparkGenerator.Generate(returnType,column);

            var newCol = udf(Struct(expressionMap[column]));
            expressionMap[column] = newCol.GetField(column).Cast(typeMapper[column].SimpleString).Alias(column);

        }

        /*
         * To create the parquet file for testing.
         * spark-shell --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2   (assuming you are writing file to S3)
         * import java.sql.Timestamp
         * val df = sc.parallelize( Seq.fill(10000){(new Timestamp(0), new Timestamp(0))}).toDF("timestamp1","timestamp2")
         * df.write.parquet("...")
         *
         * this creates a parquet file with 2 columns called timestamp1 and timestamp2.  They are both of type TimestampType
         */

        private static void Main(string[] args)
        {

            var spark = SparkSession.Builder().GetOrCreate();
            spark.Conf().Set("fs.s3a.access.key", "<AWS ACCESS ID>");
            spark.Conf().Set("fs.s3a.secret.key", "<AWS ACCESS SECRET KEY");
            var sourcePath = "s3a://<PATH TO FILE>";
            var outputPath = "s3a://<DESIRED OUTPUT LOCATION>";

            var df = spark.Read().Parquet(sourcePath);

            var expressionMap = df.Schema().Fields.ToDictionary(f => f.Name, f => df.Col(f.Name));

            Generate(expressionMap, "timestamp1", df);
            Generate(expressionMap, "timestamp2", df);

            df = df.Select(df.Schema().Fields.Select(f => expressionMap[f.Name]).ToArray());
            df.Write().Mode("overwrite").Parquet(outputPath);
        }
    }
}

The exception thrown is:

Caused by: org.apache.spark.api.python.PythonException: System.InvalidCastException: Unable to cast object of type 'Microsoft.Spark.Sql.Types.Timestamp' to type 'System.Int32'.
   at Microsoft.Spark.Sql.Types.TimestampType.FromInternal(Object obj) in /_/src/csharp/Microsoft.Spark/Sql/Types/SimpleTypes.cs:line 116
   at Microsoft.Spark.Sql.Row.Convert() in /_/src/csharp/Microsoft.Spark/Sql/Row.cs:line 152
   at Microsoft.Spark.Sql.Row..ctor(Object[] values, StructType schema) in /_/src/csharp/Microsoft.Spark/Sql/Row.cs:line 36
   at Microsoft.Spark.Sql.RowConstructor.GetRow() in /_/src/csharp/Microsoft.Spark/Sql/RowConstructor.cs:line 104
   at Microsoft.Spark.Sql.RowConstructor.GetRow() in /_/src/csharp/Microsoft.Spark/Sql/RowConstructor.cs:line 100
   at Microsoft.Spark.Worker.Command.PicklingSqlCommandExecutor.ExecuteCore(Stream inputStream, Stream outputStream, SqlCommand[] commands) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Command\SqlCommandExecutor.cs:line 146
   at Microsoft.Spark.Worker.Command.SqlCommandExecutor.Execute(Version version, Stream inputStream, Stream outputStream, PythonEvalType evalType, SqlCommand[] commands) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Command\SqlCommandExecutor.cs:line 76
   at Microsoft.Spark.Worker.Command.CommandExecutor.Execute(Stream inputStream, Stream outputStream, Int32 splitIndex, CommandPayload commandPayload) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Command\CommandExecutor.cs:line 65
   at Microsoft.Spark.Worker.TaskRunner.ProcessStream(Stream inputStream, Stream outputStream, Version version, Boolean& readComplete) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\TaskRunner.cs:line 154
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:503)
        at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81)
        at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:272)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:281)
        ... 9 more

Our investigation After investigating a bit, it appears to be because a given row’s TimestampType column is sometimes passed to TimestampType.FromInternal multiple times (once per UDF call). The first time FromInternal is called it properly creates a new Timestamp object. The second time, however, that same Timestamp object is passed again into the function where the function is expecting an int or long. The culprit might be RowConstructor.GetRow which appears to possible re-use _args. When I modify GetRows to first make a copy of _args and pass the copy to the Row constructor the issue goes away.

Desktop (please complete the following information):

  • Windows 10
  • Chrome
  • v1.0.0 running Spark 3

Additional context This error won’t always occur but occurs with high probability if the dataframe has at least 2k rows.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:26 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
akamorcommented, Oct 29, 2020

Sure. Can you reach out to hello@tonic.ai. I’ll see the e-mail and respond.

1reaction
akamorcommented, Oct 29, 2020

@elvaliuliuliu Sorry, just saw you already asked for the AWS setup. I’ll get that to you in a few minutes!

Read more comments on GitHub >

github_iconTop Results From Across the Web

apache spark - Pyspark UDF function is throwing an error
I am trying to achieve the difference between two timestamp column values. Trying to achieve the same result using different methods available ...
Read more >
Fixed Issues in Apache Impala
A call to SetError() in a user-defined function (UDF) would not cause the query to fail as expected. Bug: IMPALA-1746. Impala throws IllegalStateException...
Read more >
Why do I get 'java.io.IOException: File already exists' for ...
I have a fairly small, simple DataFrame, month: month.schema. org.apache.spark.sql.types.StructType = StructType(StructField(month,DateType,true), ...
Read more >
Troubleshoot Dataflow errors
When you try to run a job, the following error appears in the log files, and the job fails: Workflow failed. Causes: "@*"...
Read more >
Bug listing with status RESOLVED with resolution TEST- ...
Bug :233 - "Emacs segfaults when merged through the sandbox. ... error on I/O error on UDF disk" status:RESOLVED resolution:TEST-REQUEST severity:normal ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found