Skip to content

Update to rebuild timestamps when no RTC is available at boot #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

deldrid1
Copy link
Owner

@deldrid1 deldrid1 commented Oct 4, 2016

This is accomplished by logging hardware.millis() to the spiflashlogger and then syncing hardware.millis() with time(), once our cloud connection is made and time() is restored.

A few open questions/comments:

  1. This requires some minor modifications to the Bullwinkle Library
  2. There is potentially a race condition that could lead to sending an invalid time to the agent... If server.isconnected() changes between line 58 of the impPager code and _bull._sendMessage line 169, we will end up sending _bootnumber + "-" + hardware.millis() instead of time().
    • Can multiple calls to server.isconnected() return different results if Squirrel code never yields to impOS?
    • If so, we will have to make this code a bit more ugly to handle the scenario...
  3. Should we wrap / update bullwinkle to allow for multiple onSuccess/onFail handlers to be registered or just trust impPager to take care of business? This would allow us to return _bullwinkle.send() and allow additional actions outside of impPager to take place based on success/failure/ack.
  4. Is the check on line 125 really needed? I'm not sure if there a race condition where we come online (_connectionManager.isConnected() == true) but we don't yet have a clock?
  5. We should probably be doing some rounding where we calculate our delta times - right now we are relying on integer math, which will always round down (even if the calculated timestamp delta end in 999ms)
    • ~~Will we get some odd behavior if we are sending 1 second data that happens to fall close to the 500ms hardware.millis() mark? There's only a ~10/1000 chance that it happens, but I'm not sure how the long term performance of imp.wakeup(1.0) behaves... (does it tend to be precise, tend to be slow, tend to be fast, etc. - it may vary from device to device based on crystal accuracy)~~
      • Haven't seen this happen and we did get some 506ms deltaT's so I am assuming it is a non-issue
  6. Should we go ahead and [erase the spiflashlogger entry and add a new one with the RTC value corrected](

    ImpPager/ImpPager.class.nut

    Lines 156 to 157 in fafe9a2

    ) once we calculate it?
    • This would help give us improve our odds at getting a clock onto our messages (especially in cases of a very flaky internet/power situation) at the potential expense of some SPIFlash wear
    • Thinking through this a bit more, this is probably required because our math won't likely work out if a datapoint without an RTC has to have its RTC calculated a second time...
    • We will probably need to pass around the reference to the next() callback function from the spiflashlogger.read() onData callback and deal with all of the edge cases to ensure we are careful about when we schedule our next tries. Right now the onFinish is not being used since we are managing the scanning outside of the way SPIFlashLogger was really designed. If we are talking about erasing/writing data while we are reading, I assume that our current strategy will cause things to not work very well.
  7. We are writing ID's to SPIFlash but then we update them if we have a Bullwinkle packages collision - is this the right thing to do?

…ne and has no RTC.

This is accomplished by logging hardware.millis() to the spiflashlogger and then syncing hardware.millis() with time(), once our cloud connection is made and time() is restored
…cond.

Provides slightly more accurate timestamps.
Maximizes retry throughput, while ensuring that only 1 message is in flight at a time (to protect against out of memory crashes)

if("boot" in dataPoint.metadata && dataPoint.metadata.boot == _bootNumber){
local deltaTMillis = _lastTS[0] - dataPoint.ts
local deltaTSeconds = math.floor(deltaTMillis+500).tointeger()/1000 //Round to nearest second, but use this int value for updating _lastTS to keep things consistent
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

math.floor(deltaTMillis+500) == deltaTMillis+500

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be you meant
math.floor((deltaTMillis+500)/1000).tointeger()
?

Copy link
Owner Author

@deldrid1 deldrid1 Oct 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duh - yes... Fixed in 76b3988

_log_debug("Calculated new ts as " + dataPoint.ts + " (deltaT = " + deltaTMillis + " ms)")

//update _lastTS so that we can have 25 days between datapoints instead of 25 days total of timestamps that we can rebuild
_lastTS[0] -= (deltaTSeconds*1000)
Copy link

@ppetrosh ppetrosh Oct 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_lastTS[0] -= deltaTMillis�
?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it the way that I did so that I would be careful to keep _lastTS[0] and _lastTS[1] consistent and not have any kind of floating point precision issues (a .tointeger() here would probably make things a bit clearer). _lastTS[1] is only precise down to the nearest second, and I don't want to deal with rounding problems, precision issues with floats, etc. here. It does mean that you will have slightly less than the full range of hardware.millis() to work with, but what's half a second when you have a 25 day window?

There may be some edge cases to think through when you have many messages within the same second that this doesn't handle very gracefully? I have another class that pushes data on a 1 second interval "in bulk" to help throttle/prevent fast occurring events from overwhelming the system, so I haven't stress tested this too much.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guessed that. Thanks for the explanation!

_resendLoggedData(dataPoint);

// Don't do any further scanning until we get an ACK for already sent message
_next = next
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reduce the risk of a race condition we probably need to have this statement at the beginning of the function.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation, I think its a non-issue (no async code is going to jump in the middle of single-threaded squirrel), but agree - its an unnecessary risk that we can mitigate by moving this line of code.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 127c400

if("boot" in dataPoint.metadata && dataPoint.metadata.boot == _bootNumber){
local deltaTMillis = _lastTS[0] - dataPoint.ts
local deltaTSeconds = math.floor(deltaTMillis+500).tointeger()/1000 //Round to nearest second, but use this int value for updating _lastTS to keep things consistent
dataPoint.ts = _lastTS[1] - deltaTSeconds //All integer math, so no need to worry about decimal points
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And as I mentioned in the Bullwinkle PR, I'm not fully convinced yet that we should modify .ts from outside the Bullwinkle.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll discuss this afternoon...

@deldrid1
Copy link
Owner Author

deldrid1 commented Oct 7, 2016

@ppetrosh - any thoughts on the questions I posed at the beginning of the Pull Request? Creating real scenarios to test some of these conditions is pretty challenging without having access to set breakpoints in squirrel/impOS...

…tions.

Not an issue with the current implementation, but could come back to bite someone in the future.
Also, add some comments and an explicit .tointeger() cast to make our strategy for updating _lastTS a bit clearer
This message can be used by the agent to understand that it has received all of the impPager data from SPIFlash.  (You can detect the start of data from SPIFlash by checking ```if("metadata" in message && "addr" in message.metadata)```.)  In a case where you need to read data from SPIFlash from newest->oldest to rebuild timestamps, but you need to process messages from oldest->newest for whatever reason, this message will let you know that things are complete.
…a timestamp to it.

This was flawed logic... Because the order that we read and write are opposite, the risk  of overwriting precious data is greater than the risk of not associating a proper timestamp with the data.  There may be external systems that can rebuild the proper timestamps (since hardware.millis() starts at 0 at power-on time, if you know when your outages are, you can rebuild the timestamps external to the imp cloud)
@ppetrosh
Copy link

ppetrosh commented Oct 8, 2016

@deldrid1, agree, source-level debugging seems to be gaining the priority. We'll discuss this requirement internally. Though I don't think this is something we can commit to in a reasonably near future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants