Hello Dialog,
We are seeing some odd behavior with some of our systems containing DA14581 chips. In our system, the DA14581 is the slave device and different vendors are supplying the master BLE chips.
On a small minority of such systems, we observe random pauses in empty PDU transmissions from the DA14581. A BLE packet sniffer trace is attached. The random dropouts are visible in the trace as the drops in throughput. Each pause is always approximately 1.1 - 1.2 seconds in duration. There is no application layer traffic being exchanged at the time of the trace - these are all idle packets.
Our systems should be built and configured identically. The application running at the time of the trace was downloaded via the IDE (NOT running out of OTP) This behavior is only exhibited in a small minority of such systems. Slave latency is set to 0 for all (confirmed multiple times!).
Any thoughts as to what might be behind this behavior? It is causing connections to drop in these cases.
Thanks,
Jim

Hi JamesHiebert,
So, this behaviour is exhibited randomly over devices that share the same build and fw ? Is this right ? The devices that exhibit this behaviour are unique, i mean, have you performed this test on similar enviroment on multiple devices and only one (for example) device was exposing this behaviour in successive tests ? Also what about the masters connected ? Is this behaviour correlated with a specific master or this will occur regardless the master of the link ?
Do you have the actual sniffer trace to share when the incident occurs and have you verified that this side effect is due to missed events from the side of the 581, so on the sniffer log you see packets from the master side and no packet from the slave side ? You have attached a graph from the throughput, but this doens't help much to find a root cause for this, what would also help is a current capture when the issue occurs, and actually see that the 581 misses connection events through the power consumption. Also it would help mentioning the SDK that you are using and the configuration of the device, is the application running entirely on the 581 or you are using the HCI interface, i assume that the app is running on the 581 entirely. And what about your low power clock, are you using the RCX or the XTAL ?
Thanks MT_dialog
Hi Dialog,
So, this behaviour is exhibited randomly over devices that share the same build and fw ? Is this right ? [Yes. A small number (few percent) of devices that have the EXACT same configuration of HW and FW exhibit the dropout behavior.]
The devices that exhibit this behaviour are unique, i mean, have you performed this test on similar enviroment on multiple devices and only one (for example) device was exposing this behaviour in successive tests ? [Correct. In one particular environment, there is a lab with a dozen or so identical test stations, all with a DUT with a slave 581 chip, and only one DA14581 exhibited the behavior. We swapped the circuit board with a different board (different 581) and the problem went away. It has also happened in other environments. The behavior follows the 581 (or at least the circuit board on which it is soldered).]
Also what about the masters connected ? Is this behaviour correlated with a specific master or this will occur regardless the master of the link ? [The vast majority of our master devices are identical custom built boxes. There are a few outliers - commercial smart phones. I'll run an experiment with a known "failing" DA14581 and a smartphone.]
Do you have the actual sniffer trace to share when the incident occurs? [I attached a trace with 2 dropouts. We use a Teledyne LeCroy ComProbe Protocol Analysis System. Can you read those files?]
...and have you verified that this side effect is due to missed events from the side of the 581, so on the sniffer log you see packets from the master side and no packet from the slave side ? [This is also correct. The master continues to transmit idle packets with no response from the DA14581. These intervals occur somewhat randomly but always last for approximately 1.1 - 1.2 seconds.]
You have attached a graph from the throughput, but this doens't help much to find a root cause for this, what would also help is a current capture when the issue occurs, and actually see that the 581 misses connection events through the power consumption. [This will be tricky with the configuration of our DUT...]
Also it would help mentioning the SDK that you are using and the configuration of the device, is the application running entirely on the 581 or you are using the HCI interface, i assume that the app is running on the 581 entirely. [The FW running on the 581 is based on the 5.0.4 SDK, but it has been heavily modified. Most of the arch*.c files are untouched however. HCI is not used. In our system there is an app running on the 581 as well as another app running on a different IC. The two chips communicate via SPI with a custom protocol, not HCI.]
And what about your low power clock, are you using the RCX or the XTAL ? [RCX]
Some additional info. The pauses seem to disappear with longer Connection Intervals. For instance, no more than 1 slave idle packet in a row was dropped in over 7 hours with a 47.5ms CI. I am running further tests with shorter intervals.
My observations in the previous message had a CI of 11.25ms programmed.
Hi JamesHiebert,
The fact that you are able to re-establish communication after more than 1 sec of inactivity of the slave means that you should be ok on the clocks side and there is no reset while the issue occurs. I dont see anything peculiar from the master side, a previous transaction that might went wrong and affected the slave for example, the only things i can think of that might result in this condition are:
As a result the device misses events. A current trace would help use identify that, since if the device is occupied via running code we should see a flat line just consuming and if the startup FSM took extra time we should see wakes ups and no radio activity at all (since if the device had woken up to late for servicing the event it wont start a radio transaction).
What would also help, would be if you try to run a simple example and check if you can replicate this on your hw, for example the proximity reporter from the SDK. Also, does the device has any activity while the incident occurs, you mention that its connected to an external device and communicates via SPI, so is there any action on the bus during that period or perhaps a pattern that can relate the SPI transaction with the issue? Perhaps there is some kind of correlation between the issue and the communication of the 2 devices.
Thanks MT_dialog
Hi Dialog,
Sorry for the long delay. We've been collecting some data on the issue which may help you help us.
Here's a summary of what we know:
•停顿只发生在设备(5%的一个子集)? All devices are running the same application SW.
• When the pauses do occur:
○ They occur randomly
○ The duration of the pause depends on the master's SCA (see Dialog Drop-outs by CI.jpg in attached).
○ The pauses only occur if the Connection Interval is a multiple of 11.25ms (22.5ms, 45ms,...). If other CI's are negotiated, no drop-outs (see Dialog Drop-outs by CI.jpg in attached zip).
○ A current trace shows consistent timing but no current spikes (representing RX and TX activity) when the pauses occur (see ComparePause_NoPause.png in attached zip). The upper trace shows CE's without pauses (two tall spikes representing idle TX and RX activity are seen), and the lower shows CE's with pauses. The timing looks consistent.
It could be something our application SW is doing (or not doing), but it isn't obvious to us. Thanks for any help you can provide!
“醒来序列more than expected (clock settling or code has been added in wake-up sequence) and the device reaches the programming of the event too late (there is an assertion in the SLP handler in the power_up() function, is that assertion still exists on your code ?)"
Regarding your question above about the assertion, is this the code in question? I modified the ASSERT_WARNING to ASSERT_ERROR to cause a very visible reset should this condition occur.
When pauses are occurring, the ASSERT_ERROR below does NOT occur.
/*
* Check if BLE_SLP_IRQ has already asserted. In this case, we delayed in periph_init().
* Increase LP_ISR_TIME_XTAL32_CYCLES and LP_ISR_TIME_USEC values to give more execution time
* to periph_init().
*/
如果(GetBits32 (BLE_INTSTAT_REG SLPINTSTAT))
ASSERT_ERROR(0);
Hi jamesHiebert,
From the traces that you have attached it seems that the device properly wakes up, but apparently the time for the Rx/Tx event is wrong, most probably delayed, this is the only explanation that i can come up with for not having RF activity on certain wake ups. Although if this time was wrong then the assertion that i ve mentioned should hit, also can you please check if the second assertion occurs in the lld_sleep_compensate_func_patched(). Also is there any activity on the SPI device that communicates with the 581 while the issue occurs, perhaps what you are experiencing is somekind of a race condition, since i am not able to justify why when using multiples of the 11.25 connection interval the device will behave like that.
Thanks MT_dialog
Hello again Dialog,
We are still looking into this issue. Some new learning, perhaps this will help you help us?
The pauses do not occur with sleep disabled.
No asserts are seen when the dropouts happen, including the ones in lld_sleep_compensate_func_patched()
I modified the existing rwble.c "last event" logging to record the events to a large circular buffer, then read the buffer when a drop out occurred.
当一切都好,我看到BLE_EVT_LP, BLE_EVT_SLP BLE_EVT_CSCNT, BLE_EVT_RX, BLE_EVT_TX, BLE_EVT_END repeating.
When drop outs occur, I see only BLE_EVT_LP, BLE_EVT_SLP, BLE_EVT_CSCNT repeating.
I will check for SPI activity, but there shouldn't be any....
Also please note that setting USE_POWER_OPTIMIZATIONS to 0 with extended sleep enabled seems to eliminate the pauses/dropouts. We do not use deep sleep.
Hi JamesHierbert,
The fact that you are seeing only the BLE_EVT_LP, BLE_EVT_SLP, BLE_EVT_CSCNT when the issue occurs just confirms the fact that the device wakes up properly but with no BLE
activity, as far as i am aware and as mentioned in previous posts, the only reason i can think of is if there are delays before the RX/TX activities that will prevent the device on waking up on time. Since you reported that you are not able to see any ASSERTIONS from the allready debugging structures that the SDK has, then there must be a delay between the BLE_EVT_SLP and the BLE_EVT_CSCNT. From the captures that you have provide i am not able to track the CSCNT event unless is the CSCNT part is executed in the last bump i see in the problematic trace (which in that case loosing the event is justified). It seems from the trace that time between the SLP event and the CSCNT is significantly larger than the proper trace. So right after the SLP then the main while function of the SDK has the opportunity to run, are you certain that you dont close the interrupts under some condition in your application code ? Can you please take some time measurements following the instructions below:
In the BLE_CSCNT_Handler() try to log the following values when the issue occurs, evt->time (time that the next event should be scheduled) along with the current time count, to do that you can use the below snippet:
For testing purposes i ve used two 16 bytes length arrays.
struct lld_evt_tag *evt = (struct lld_evt_tag *)co_list_pick(&lld_evt_env.evt_prog);
time1_log[time_idx]=evt->time;
time2_log[time_idx]= lld_evt_time_get();
time_idx++;
time_idx&=0xF;
Normally the values collected from time1_log should be at least 2 slots in the future than the current time value (lld_evt_time_get() time2_log). If not then that means that something (most probably from application level) is stalling the programming of the BLE event, and apparently the BLE event never occurs since its delayed.
Thanks MT_dialog
Hello Dialog,
Sorry for slow responses. I am working on collecting the data you've requested, but its a bit of a challenge because of the lag between the actual pause and when I notice it on the sniffer. I need sufficient buffer space to record the timing data back that far and memory is running low! I will continue to work on this.
In the mean time, I've discovered the following: if I disable SPI during a connection the dropouts cease.
As you know, every time BLE wakes up from extended sleep the peripherals must be reinitialized. In our system there is an app running on the 581 as well as another app running on a different IC. The two chips communicate via SPI with a custom protocol. I put logic in our 581 app to stop re-enabling SPI after extended sleeping. Specifically when the following line of code is no longer run, I see that the pauses immediately stop.
SetBits16(SPI_CTRL_REG,SPI_ON, 1);
I have also tried masking the SPI interrupt and stopping other GPIO interrupts between the BLE and other chip in the system instead (see below) but in these cases the pauses continue. Above is the only peripheral re-initialization step that skipping seems to stop the pauses.
• Not related to enabling the SPI interrupt
• Not related to signaling the other chip via GPIO.
• Not related to SPI internal clock divider setting (CLK_PER_REG changed from /2 to /4)
• Not related to SPI interrupt priority (increased from 20 to 3)
• Not related to CLK_PER_REG enable
• Not related to configuring the GPIO pins (input/output, function, etc.) for SPI
Hello Dialog,
Please find attached a write up with the timing data collected as suggested in your Thu, 2018-07-12 10:36 post.
Hi Jim,
I will be dealing with you offline on this issue. For others following this issue we will makesure to log our results.
/MHv