question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Overall task input size

See original GitHub issue

TL;DR: A new function to return the total size of all inputs to a task, eg

task foo {
  File input1
  StructType input2
  Array[File] input3

  Float size_sum = total_input_size("GB")

  command { ... }
  runtime { ... }
  output { ... }
}

Detail:

It’s a fairly regular pattern amongst people to want to base the disk/memory/cpu for a task on the size of input files. Where this gets tricky is that it’s (a) tedious and (b) error-prone to have to include every input file individually in this calculation. Especially after a refactor, it’s easy to accidentally forget to include one of the files into the sum or miss one because it’s nested in an object/struct.

I’d like to propose a function total_input_size(), only available within task definitions, which would return the total size of every file needing to be localized to the execution environment for the task.

Like the current size() function, I’d include the optional unit parameter to let people specify the result in MB, GiB, etc.

Before I write up a SPEC change proposal does this sound like a good idea or do people have concerns?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:10 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
patmageecommented, Apr 1, 2021

@vdauwera yes, that is exactly the case. I think we can actually close this in favor of #169

0reactions
vdauweracommented, Apr 1, 2021

Would this basically be syntactic sugar to shortcut using size(Array[input Files]) as was added by #169?

Read more comments on GitHub >

github_iconTop Results From Across the Web

How do I interpret Input size / records in Spark Stage UI
It means that at a time, your task is only executing approximately 14 MB of data which is too low. The thumb rule...
Read more >
Understanding Spark UI — Part 1 - Medium
Key things to look for in the Tasks page are : 1. Input Size — Input for the stage. The expectation is the...
Read more >
Web UI - Spark 3.0.0-preview2 Documentation
The stage detail page begins with information like total time across all tasks, Locality level summary, Shuffle Read Size / Records and Associated...
Read more >
Stage Details · Spark - (@mallikarjuna_g) on GitBook
If the stage has an input, the 8th row is Input Size / Records which is the bytes and records read from Hadoop...
Read more >
StagePage - The Internals of Apache Spark
If the stage has an input, the 8th row is Input Size / Records which is the ... Executor ID; Address; Task Time;...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found