Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Overall task input size

See original GitHub issue

TL;DR: A new function to return the total size of all inputs to a task, eg

task foo {
  File input1
  StructType input2
  Array[File] input3

  Float size_sum = total_input_size("GB")

  command { ... }
  runtime { ... }
  output { ... }
}

Detail:

It’s a fairly regular pattern amongst people to want to base the disk/memory/cpu for a task on the size of input files. Where this gets tricky is that it’s (a) tedious and (b) error-prone to have to include every input file individually in this calculation. Especially after a refactor, it’s easy to accidentally forget to include one of the files into the sum or miss one because it’s nested in an object/struct.

I’d like to propose a function total_input_size(), only available within task definitions, which would return the total size of every file needing to be localized to the execution environment for the task.

Like the current size() function, I’d include the optional unit parameter to let people specify the result in MB, GiB, etc.

Before I write up a SPEC change proposal does this sound like a good idea or do people have concerns?

Issue Analytics

State:
Created 5 years ago
Comments:10 (9 by maintainers)

Top GitHub Comments

1reaction

patmageecommented, Apr 1, 2021

@vdauwera yes, that is exactly the case. I think we can actually close this in favor of #169

0reactions

vdauweracommented, Apr 1, 2021

Would this basically be syntactic sugar to shortcut using size(Array[input Files]) as was added by #169?

Top Results From Across the Web

How do I interpret Input size / records in Spark Stage UI

It means that at a time, your task is only executing approximately 14 MB of data which is too low. The thumb rule...

Understanding Spark UI — Part 1 - Medium

Key things to look for in the Tasks page are : 1. Input Size — Input for the stage. The expectation is the...

Web UI - Spark 3.0.0-preview2 Documentation

The stage detail page begins with information like total time across all tasks, Locality level summary, Shuffle Read Size / Records and Associated...

Stage Details · Spark - (@mallikarjuna_g) on GitBook

If the stage has an input, the 8th row is Input Size / Records which is the bytes and records read from Hadoop...

StagePage - The Internals of Apache Spark

If the stage has an input, the 8th row is Input Size / Records which is the ... Executor ID; Address; Task Time;...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Overall task input size

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Allow length to work on optional arrays

Multi-line strings in meta, parameter_meta