-
Notifications
You must be signed in to change notification settings - Fork 28.5k
[SPARK-51272][CORE] Aborting instead of re-submitting of partially completed indeterminate result stage #50630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
if (eventQueue.nonEmpty) { | ||
post(eventQueue.remove(0)) | ||
} | ||
// `DAGSchedulerEventProcessLoop` is guaranteed to process events sequentially in the main test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is modified as I have seen the following in the unit-tests.log without it (focus on the thread names: pool-1-thread-1-ScalaTest-running-DAGSchedulerSuite
, dag-scheduler-message
):
25/04/17 14:15:05.815 pool-1-thread-1-ScalaTest-running-DAGSchedulerSuite INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1662
25/04/17 14:15:05.815 pool-1-thread-1-ScalaTest-running-DAGSchedulerSuite INFO DAGSchedulerSuite$MyDAGScheduler: Submitting 2 missing tasks from ResultStage 2 (DAGSchedulerSuiteRDD 2) (first 15 tasks are for partitions Vector(0, 1))
25/04/17 14:15:05.816 pool-1-thread-1-ScalaTest-running-DAGSchedulerSuite INFO DAGSchedulerSuite$MyDAGScheduler: Marking ResultStage 2 () as failed due to a fetch failure from ShuffleMapStage 1 (RDD at DAGSchedulerSuite.scala:123)
25/04/17 14:15:05.817 pool-1-thread-1-ScalaTest-running-DAGSchedulerSuite INFO DAGSchedulerSuite$MyDAGScheduler: ResultStage 2 () failed in 3 ms due to ignored
25/04/17 14:15:05.817 pool-1-thread-1-ScalaTest-running-DAGSchedulerSuite INFO DAGSchedulerSuite$MyDAGScheduler: Resubmitting ShuffleMapStage 1 (RDD at DAGSchedulerSuite.scala:123) and ResultStage 2 () due to fetch failure
25/04/17 14:15:05.817 pool-1-thread-1-ScalaTest-running-DAGSchedulerSuite INFO DAGSchedulerSuite$MyDAGScheduler: Executor lost: hostA-exec (epoch 3)
25/04/17 14:15:05.818 pool-1-thread-1-ScalaTest-running-DAGSchedulerSuite INFO DAGSchedulerSuite$MyDAGScheduler: Shuffle files lost for executor: hostA-exec (epoch 3)
25/04/17 14:15:06.023 dag-scheduler-message INFO DAGSchedulerSuite$MyDAGScheduler: Resubmitting failed stages
25/04/17 14:15:06.024 dag-scheduler-message INFO DAGSchedulerSuite$MyDAGScheduler: Submitting ShuffleMapStage 1 (DAGSchedulerSuiteRDD 0), which has no missing parents
25/04/17 14:15:06.025 dag-scheduler-message INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 2.9 KiB, free 2.4 GiB)
25/04/17 14:15:06.025 dag-scheduler-message INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1825.0 B, free 2.4 GiB)
25/04/17 14:15:06.026 dag-scheduler-message INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1662
25/04/17 14:15:06.027 dag-scheduler-message INFO DAGSchedulerSuite$MyDAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 1 (DAGSchedulerSuiteRDD 0) (first 15 tasks are for partitions Vector(0))
25/04/17 14:15:06.028 dag-scheduler-message INFO DAGSchedulerSuite$MyDAGScheduler: Submitting ShuffleMapStage 0 (DAGSchedulerSuiteRDD 1), which has no missing parents
25/04/17 14:15:06.029 dag-scheduler-message INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 2.9 KiB, free 2.4 GiB)
25/04/17 14:15:06.029 dag-scheduler-message INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 1826.0 B, free 2.4 GiB)
25/04/17 14:15:06.030 dag-scheduler-message INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1662
25/04/17 14:15:06.030 dag-scheduler-message INFO DAGSchedulerSuite$MyDAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (DAGSchedulerSuiteRDD 1) (first 15 tasks are for partitions Vector(0, 1))
25/04/17 14:15:06.227 pool-1-thread-1-ScalaTest-running-DAGSchedulerSuite INFO DAGSchedulerSuite$MyDAGScheduler: ShuffleMapStage 0 (RDD at DAGSchedulerSuite.scala:123) finished in 198 ms
25/04/17 14:15:06.228 pool-1-thread-1-ScalaTest-running-DAGSchedulerSuite INFO DAGSchedulerSuite$MyDAGScheduler: looking for newly runnable stages
25/04/17 14:15:06.228 pool-1-thread-1-ScalaTest-running-DAGSchedulerSuite INFO DAGSchedulerSuite$MyDAGScheduler: running: HashSet(ShuffleMapStage 1)
25/04/17 14:15:06.229 pool-1-thread-1-ScalaTest-running-DAGSchedulerSuite INFO DAGSchedulerSuite$MyDAGScheduler: waiting: HashSet(ResultStage 2)
25/04/17 14:15:06.229 pool-1-thread-1-ScalaTest-running-DAGSchedulerSuite INFO DAGSchedulerSuite$MyDAGScheduler: failed: HashSet()
25/04/17 14:15:06.233 pool-1-thread-1-ScalaTest-running-DAGSchedulerSuite INFO DAGSchedulerSuite$MyDAGScheduler: ShuffleMapStage 1 (RDD at DAGSchedulerSuite.scala:123) finished in 209 ms
cc @mridulm |
The "java.lang.OutOfMemoryError: Java heap space" in the pyspark-pandas-connect-part2 is unrelated. |
After the test was restarted the error is resolved. |
Only unsuccessful (and so uncommitted) tasks are candidates for (re)execution (and so commit) - not completed tasks.
As discussed here, this is a bug in jdbc implementation - the txn commit should be done in a task commit, not as part of
The fix for this is to handle something similar to this. I have sketched a rough impl here for reference (it is just illustrative ! and to convey what I was talking about).
Option 1 is much more aggressive with cleanup, but might spuriously kills jobs a lot more than required. (I have adapted the tests you included in this PR for both - and they both pass) |
But that's also bad for an indeterminate stage as the data is inconsistent. I mean the committed partitions are coming from a previous old computation and not from the latest one but the resubmitted ones are coming from the new one. To illustrate it:
So if we write the |
If parent map stage was indeterminate - existing spark code would have already aborted the stage - if there was a fetch failure for that parent stage. As you have pointed out in the test in this PR, there is a gap in the existing impl - which is that when there is a shuffle loss due to executor/host failure (and not detected through a fetch failure) - the check for determinism was not being performed before recomputing the lost data; and so if shuffle files are lost for an indeterminate stage, but never resulted in a Fetch failure (in the test - But that does not require failing the result stage - even if it is indeterminate - if no indeterminate parent has lost any shuffle outputs.
This will not happen - please see above. "some but not all tasks was successful and a resubmit happened" -> if it results in reexecution of the parent (indeterminate) stage through a fetch failure, job will be aborted. Please do let me know if I am missing some nuance. (Edited to hopefully improve clarity !) |
@mridulm regarding option 2 why a return is enough here (and not an meanwhile when there is an exception at task creation an Why we need to check whether all jobs should be aborted and not only just one, here: |
It should result in same behavior (all jobs, this stage was part of, have been aborted in that scenario - and we have not added the stage to the runningStages yes).
Stage can be part of multiple concurrent jobs, and not all of them might be getting aborted: some of them might not have started a result stage, and so recoverable. |
What changes were proposed in this pull request?
This PR aborts the indeterminate partially completed result stage instead of resubmitting it.
Why are the changes needed?
A result stage compared to shuffle map stage has more output and more intermediate state:
FileOutputCommitter
where each task does a Hadoop task commit. In case of a re-submit this will lead to re-commit that Hadoop task (possibly with different content)As long as rollback of a result stage is not supported (https://issues.apache.org/jira/browse/SPARK-25342) the best we can is abort the stage.
The existing code before this PR already tried to address a similar situation at the handling of
FetchFailed
when the fetch is coming from an indeterminate shuffle map stage: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L2178-L2182But this is not enough as a
FetchFailed
from a determinate stage can lead to an executor loss and a re-compute of the indeterminate parent of the result stage as shown in the attached unittest.Moreover the
FetchFailed
can be in race with a successfulCompletionEvent
. This is why this PR detects the partial execution at the re-submit of the indeterminate result stage.Does this PR introduce any user-facing change?
No.
How was this patch tested?
New unit tests are created to illustrate the situation above.
Was this patch authored or co-authored using generative AI tooling?
No.