Skip to content

datadeps: Implement an optimizing scheduler #592

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

jpsamaroo
Copy link
Member

At its core, this PR implements an numerical optimizer-based scheduler for Datadeps. This scheduler uses JuMP to implement the scheduler designed by @pszufe and documented at https://github.com/pszufe/DagScheduler. The idea of this scheduler is to aggressively, ahead-of-time optimize a Datadeps DAG based on all available information. This scheduler, by its nature, has the ability to make nearly-optimal scheduling decisions - this is different from our existing JIT-style schedulers, which don't optimize over the entire DAG, but only look at a few tasks currently in front of them.

To make this scheduler work, some additional improvements were made:

  • A new library, MetricsTracker.jl, was implemented to make it easy to declaratively configure which metrics to collect during task scheduling and execution. It also provides mechanisms to efficiently search through collected metric values for those matching a certain combination of target keys, like selected processor, task signature, and more. This is used by the scheduler to lookup information relevant to each task, like estimated execution time and transfer costs.
  • Schedules generated by Datadeps are now cached and reused, when possible, within the same session. Submitted DAGs are compared for similarity, and if a match is found, the previously-generated schedule is reused. This allows potentially expensive scheduling operations to be amortized when Datadeps operations are being called repeatedly.

Todo:

  • Think about a solution for stale metrics when reusing schedules
  • Add tests
  • Add docs

@jpsamaroo jpsamaroo force-pushed the jps/datadeps-opt-sched branch from c85eed4 to 256057d Compare March 31, 2025 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant