Skip to content

xds: generic lrs client for load reporting #8250

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

purnesh42H
Copy link
Contributor

@purnesh42H purnesh42H commented Apr 15, 2025

This is the change to make generic LRS client for load reporting to LRS server.

The PR copies the existing

  • xds/internal/xdsclient/load/store.go,
  • xds/internal/xdsclient/transport/lrs/lrs_stream.go,
  • xds/internal/xdsclient/load/store_test.go
  • xds/internal/xdsclient/tests/loadreport_test.go

from internal xdsclient code and then modify them to use the generic client types and interfaces. Each "copy" commit is followed by the "modify" commit for that file. Reviewers can start from reviewing the "modify" commit.

PS: Currently loadreport_test.go has compilation error as so its commented out as it is depends on some of the functions added in #8183

RELEASE NOTES: None

@purnesh42H purnesh42H added Type: Feature New features or improvements in behavior Area: xDS Includes everything xDS related, including LB policies used with xDS. labels Apr 15, 2025
@purnesh42H purnesh42H added this to the 1.73 Release milestone Apr 15, 2025
Copy link

codecov bot commented Apr 15, 2025

Codecov Report

Attention: Patch coverage is 78.68132% with 97 lines in your changes missing coverage. Please review.

Project coverage is 82.29%. Comparing base (82e25c7) to head (a263158).
Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
xds/internal/clients/lrsclient/lrs_stream.go 66.85% 43 Missing and 15 partials ⚠️
xds/internal/clients/lrsclient/lrsclient.go 74.15% 17 Missing and 6 partials ⚠️
xds/internal/clients/lrsclient/load_store.go 91.44% 12 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8250      +/-   ##
==========================================
+ Coverage   82.15%   82.29%   +0.14%     
==========================================
  Files         412      419       +7     
  Lines       40562    41787    +1225     
==========================================
+ Hits        33322    34390    +1068     
- Misses       5875     5949      +74     
- Partials     1365     1448      +83     
Files with missing lines Coverage Δ
xds/internal/clients/lrsclient/logging.go 100.00% <100.00%> (ø)
xds/internal/clients/lrsclient/load_store.go 91.62% <91.44%> (+91.62%) ⬆️
xds/internal/clients/lrsclient/lrsclient.go 74.15% <74.15%> (+74.15%) ⬆️
xds/internal/clients/lrsclient/lrs_stream.go 66.85% <66.85%> (ø)

... and 30 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@purnesh42H purnesh42H force-pushed the generic-xds-client-lrs-client-e2e branch from 78ab34a to ce9ba3d Compare April 19, 2025 18:36
@purnesh42H purnesh42H requested a review from dfawley April 21, 2025 05:39
@purnesh42H
Copy link
Contributor Author

purnesh42H commented Apr 21, 2025

@dfawley assigning this for review since #8183 is close now. loadreport_test.go is commented due to testing helpers introduced in 8183 but i have tested in my fork. Will add Easwar once he finishes 8183.

@purnesh42H purnesh42H force-pushed the generic-xds-client-lrs-client-e2e branch 2 times, most recently from e140792 to 2e2674f Compare April 21, 2025 18:22
Copy link
Member

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine overall. Just a few comments inline.

func (ls *LoadStore) ReporterForCluster(clusterName, serviceName string) PerClusterReporter {
panic("unimplemented")
func (ls *LoadStore) ReporterForCluster(clusterName, serviceName string) *PerClusterReporter {
if ls == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to panic if a nil LoadStore is used. Why not? It seems like a pretty severe programming error.

Copy link
Contributor Author

@purnesh42H purnesh42H Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed nil check. @easwars any reason why this check is there in existing code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably for tests to use a nil load store. If that is not required anymore and tests are happy, we should be good to remove the nil check.

}

// CallStarted records a call started in the LoadStore.
func (p *PerClusterReporter) CallStarted(locality string) {
panic("unimplemented")
if p == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above. And below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

func (rcd *rpcCountData) decrInProgress() {
atomic.AddUint64(rcd.inProgress, negativeOneUInt64) // atomic.Add(x, -1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the const doesn't seem to buy us anything, since we're already needing to comment what this means. IMO delete the constant and inline it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it inline

Comment on lines 457 to 460
s = rld.sum
rld.sum = 0
c = rld.count
rld.count = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do something like this, which might(?) be more quickly understood:

Suggested change
s = rld.sum
rld.sum = 0
c = rld.count
rld.count = 0
s, rld.sum = rld.sum, 0
c, rld.count = rld.count, 0

Or,

Suggested change
s = rld.sum
rld.sum = 0
c = rld.count
rld.count = 0
s, c = rld.sum, rld.count
rld.sum, rld.count = 0, 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did the first one

c = rld.count
rld.count = 0
rld.mu.Unlock()
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't use bare returns that return values. It can be hard to understand what's going on, mainly for longer functions.

Suggested change
return
return s, c

Or pair with the second option above:

func (rld *rpcLoadData) loadAndClear() (float64, int64) {
	rld.mu.Lock()
	defer rld.mu.Unlock()

	s, c := rld.sum, rld.count
	rld.sum, rld.count = 0, 0
	return s, c
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah added them to return

return c, err
}

/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this commented out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have mentioned in the PR description. It is because the tests had compilation errors because some things were added in 8183 PR. Now its merged so i have rebased on latest and uncommented it.

@dfawley dfawley assigned purnesh42H and unassigned dfawley Apr 21, 2025
@dfawley
Copy link
Member

dfawley commented Apr 21, 2025

I mostly skimmed the changes - @easwars may also want to take a quick pass.

The commits here aren't quite as easy to review as the last change, since they go file-by-file. It would have been easier if one commit copied all the files, so that we could just skip that one commit when reviewing.

@easwars easwars self-assigned this Apr 22, 2025
@purnesh42H purnesh42H force-pushed the generic-xds-client-lrs-client-e2e branch from 0a8d352 to aaa667d Compare April 23, 2025 16:22
@purnesh42H purnesh42H removed their assignment Apr 23, 2025
Comment on lines 47 to 50
// Note that new entries are added to this map, but never removed. This is
// potentially a memory leak. But the memory is allocated for each new
// (cluster,service) pair, and the memory allocated is just pointers and
// maps. So this shouldn't get too bad.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add this to the list of things that are being tracked? Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i had already added. Initially, I was thinking to first change in internal code but i think its fine to fix here after migration.

clusters map[string]map[string]*PerClusterReporter
}

// newStore creates a LoadStore.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comments needs to be updated to match the function name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 72 to 75
// Wait for the provided context to be done (timeout or cancellation).
if ctx != nil {
<-ctx.Done()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How and why would a user use this feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that if its last reference to underlying stream, stop should wait for context to be done which will allow any pending load to be flushed during the wait. If not, then just go ahead and close the straem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the logic to wait only if its the last reference just before closing the straem.

func (ls *LoadStore) ReporterForCluster(clusterName, serviceName string) PerClusterReporter {
panic("unimplemented")
func (ls *LoadStore) ReporterForCluster(clusterName, serviceName string) *PerClusterReporter {
if ls == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably for tests to use a nil load store. If that is not required anymore and tests are happy, we should be good to remove the nil check.

// If a cluster's loadData is empty (no load to report), it's not appended to
// the returned slice.
func (ls *LoadStore) stats(clusterNames []string) []*loadData {
var ret []*loadData
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can we please move this variable declaration to be on the line above the if len(clusterNames) == 0 { ... }. The code just reads better when it starts with a call to lock and a deferred call to unlock and everything else follows after that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 317 to 318
// appendClusterStats gets Data for the given cluster, append to ret, and return
// the new slice.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment needs some updating since there is no single "given cluster", but a collection of clusters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

wg.Wait()

gotStoreData := ls.stats()
if diff := cmp.Diff(wantStoreData, gotStoreData, cmpopts.EquateEmpty(), cmp.AllowUnexported(loadData{}), cmpopts.IgnoreFields(loadData{}, "reportInterval")); diff != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diff command is repeated multiple times with the same (or maybe almost same) set of options. Maybe a helper function can be written to do the comparison and the test can simply call if err := verifyStoreData(got, want); err != nil { ... }?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to helper

}
var resp v3lrspb.LoadStatsResponse
if err := proto.Unmarshal(r, &resp); err != nil {
lrs.logger.Infof("Failed to unmarshal response to LoadStatsResponse: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a warning log for the marshal error, but an info log for the unmarshal error. Please use a consistent log level, and if you end up choosing info, please add a verbosity check too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made both Infof guarded by verbosity check

Comment on lines +54 to +56
// The LRSClient owns a bunch of streams to individual LRS servers.
//
// Once all references to a stream are dropped, the stream is closed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once all references to a channel are dropped

What channel are we referring to here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the streams which are running for each server. are you suggesting to mention them as channel?

Comment on lines 159 to 164
if lrs.cancelStream == nil {
// It is possible that Stop() is called before the cleanup function
// is called, thereby setting cancelStream to nil. Hence we need a
// nil check here bofore invoking the cancel function.
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite follow this. The only place that sets lrs.cancelStream to nil is right below in this same closure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it. May be it had a reason in internal code because it was within xds client? Here its not possible

@easwars easwars assigned purnesh42H and unassigned easwars Apr 24, 2025
func New(config Config) (*LRSClient, error) {
switch {
case config.Node.ID == "":
return nil, errors.New("lrsclient: node ID is empty")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the error message says "node ID", but this is the complete node configuration. So, please update the error message to reflect that.

Copy link
Contributor Author

@purnesh42H purnesh42H Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the case is checking for node ID. We shouldn't have node without ID. Updated to mention that.

@@ -30,6 +33,28 @@ import "context"
//
// It is safe for concurrent use.
type LoadStore struct {
lrsStream *streamImpl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason we need the streamImpl here is to call stop on it. Why can't we simply have a function pointer here instead?

Copy link
Contributor Author

@purnesh42H purnesh42H Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i moved the function to LoadStore instead and setting to load store directly from lrs client. Also, renamed to Stop

Comment on lines 136 to 140
ctx, cancel := context.WithCancel(context.Background())
lrs.cancelStream = cancel
lrs.doneCh = make(chan struct{})
lrs.loadStore = newLoadStore(lrs)
go lrs.runner(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can all of this be done in newStreamImpl?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

delete(c.lrsStreams, serverIdentifier)
tr.Close()
}
lrs.cleanup = cleanup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how I see the API being used:

  • user creates a new LRS client with lrsclient.New()
  • when they want to report load to a server, they would call ReportLoad on the returned LRS client from above
    • this would create a new streamImpl if required or reuse an existing one
    • the user is returned a reference to the associated LoadStore
  • going forward, all user interactions are with the LoadStore and the PerClusterReporter returned from the store
  • Eventually, they would call Stop() on the store to indicate that they no longer wish to report loads
    • this calls streamImpl.stop(), which calls this cleanup function
    • we need a once func somewhere in this call path to ensure that multiple calls to LoadStore.Stop() does not decrement the ref count multiple times

Copy link
Contributor Author

@purnesh42H purnesh42H Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's the flow.

we need a once func somewhere in this call path to ensure that multiple calls to LoadStore.Stop() does not decrement the ref count multiple times

once func is not possible because LoadStore is shared? The stop function though has a check if lrs reference is already 0 and stop is called, it returns early but log an error.

@@ -43,38 +68,372 @@ type LoadStore struct {
// attempt to flush any unreported load data to the LRS server. It will either
// wait for this attempt to complete, or for the provided context to be done
// before canceling the LRS stream.
func (ls *LoadStore) Stop(ctx context.Context) error {
panic("unimplemented")
func (ls *LoadStore) Stop(ctx context.Context) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this paragraph about reference counting from this docstring.

Also, where do we make the last attempt to flush any unsent loads?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i modified it to just callout the last reference to stream and if this is the last reference to stream, only then wait for context to be done.

Also, where do we make the last attempt to flush any unsent loads?

I mentioned in the other comment. Basically, we wait for context to complete to allow some time for load to be flushed based on load reporting interval because i think we don't want to make multiple attempts in less than reporting interval? wdyt?

func (*LRSClient) ReportLoad(_ clients.ServerIdentifier) *LoadStore {
panic("unimplemented")
// ReportLoad creates and returns a LoadStore for the caller to report loads
// using a LoadReportingStream.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should mention here that the caller should call Stop on the returned LoadStore when they are done reporting load to this server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@purnesh42H purnesh42H force-pushed the generic-xds-client-lrs-client-e2e branch from 9b88d68 to a263158 Compare April 25, 2025 06:15
@purnesh42H purnesh42H requested a review from easwars April 25, 2025 06:17
@purnesh42H purnesh42H assigned easwars and unassigned purnesh42H Apr 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: xDS Includes everything xDS related, including LB policies used with xDS. Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants