Problem Solving

Below are a some select examples of my problem solving skills:

2.4GHz Transceiver and Stack Bugs

I provided extensive firmware consulting for a client who wanted their gateway product to communicate to other devices using a proprietary 2.4GHz transceiver and stack. Having significant experience with Texas Instrument’s 2.4GHz transceivers and stacks I quickly took note of what appeared to be an abnormally high percentage of packet loss. I also thought that the types of packet loss that I was witnessing hinted at multiple root causes.

In order to allow other threads of client’s hardware, firmware, and software projects to move forward, I provided a temporary firmware workaround. After some critical path tasks had been completed, I began to look at the communication errors since, even with the workaround, the errors were beginning to become quite problematic. I setup controlled tests using two silicon vendor development boards and made small modifications to the demo firmware that allowed me to isolate and detect communication errors.

…eliminated all of the issues that were attributable to substantial packet loss…

Ultimately, I identified five root causes of the communication errors, both firmware and hardware related, including: improper configuration of the transceiver’s Clear Channel Assessment register when using the vendor’s high power module, an out of order receive buffering issue in the stack, a transmit completion interrupt that occasionally never occurred causing the stack to perform a full transceiver initialization for recovery, an insufficient default timeout for MAC layer ACKs, and corrupted reads from the transceiver’s RXFIFO.

I wrote a detailed explanation of my test procedures and discoveries and sent them to the silcon vendor. While waiting for a response, I finalized a number of fixes and workarounds for the client’s products and was able to achieve similar performance to the Texas Instrument’s transceiver and stack. This eliminated all of the issues that were attributable to substantial packet loss and the client was very pleased. The silicon vendor eventually responded and agreed with the analysis except for the MAC ACK timeout. Interestingly, they also admitted that early versions of the transceiver’s errata had described the corrupted RXFIFO reads, but since they was believed that there were no feasible workarounds they removed the problem description from their latest documentation!

Crypto Library Performance Optimizations

I was asked to decrease the setup time of APC’s network management product’s HTTPS and SSH connections. The core bottleneck of the public key algorithms used during connection establishment were the already heavily optimized big number calculations. I was able to make some slight instruction substitutions to the big number assembler code that resulted in fewer clock cycles to perform the same effective work. These routines were being called hundreds of thousands of times, so even a few clock cycles added up to measurable changes in performance.

… increased the speed of establishing HTTPS and SSH connections by 30%…

Additionally I made changes to the code that allowed the assembler routine to run from RAM instead of flash. The RAM has two fewer wait states than flash, which resulted in a savings of 40ns per instruction fetch. I was able to increase the speed of establishing HTTPS (7->5 seconds) and SSH (17->12 seconds) connections by approximately 30% and made several important customers very happy.

Analytics Optimizations

Alektrona’s HVAC/R monitoring web services platform was moving from an early pilot stage, where it was processing and alarming on data from about two dozen heating and cooling systems, to a full pilot stage that required handling several hundred to a thousand systems simultaneously. However, the pilot server was already having trouble managing the existing load so I need to figure out what was running so slowly.

The web services were split into three parts, the analytics service, web portal service, and communications service. By looking at the process statistics it was clear that the bottleneck was in the analytics service, which was accessing data from a SQL database, and then processing that data to determine the start or end of alarm conditions. Initially it was unclear which part of the analytics code was slow, so I added some instrumentation which allow me to pinpoint that the vast majority of execution time in the service was spent retrieving data from the database.

 …increased the performance of the analytics service by 800%…

I investigated how I might optimize the select calls by running some simple, isolated experiments with the database. I observed that the time spent accessing the database was dominated by the where clause that bounded the time frame of the data, and was not greatly affected by the actual amount of data being retrieved. Each of the eight alarm conditions were reading a variable time range of data based on user configuration, but never more than 24 hours worth. So I decided to always read 24 hours worth of data, cache the results in memory, and have the code, for each alarm, filter out data points not relevant to its processing.

Using this technique I was able to increase the performance of the analytics service by 800% with only very minor and easily verifiable code changes. Along with some reductions in the resolution of data being captured by the system, this quickly positioned the company for the next pilot stage.

NOR Flash Flash Bug

During qualification of a NOR Flash part, I observed that the network management card would occasionally hang during a firmware upgrade approximately 1 in 500 times. Since it could take many hours to reproduce the problem I decided to modify the OS firmware loader to continuously reload itself and was able to more quickly reproduce the problem. To further narrow down the problem I introduced some instrumentation into the flash driver that would drive an I/O pin high and low on entry and exit of the flash write function.

…averted a field recall by indentifying the problem and denying part qualification…

With a scope, I was able to determine the hang occurred during a flash write operation. After narrowing down the problem further, I was convinced that the flash part was prematurely indicating that a word had finished programming. The part had multiple methods of checking for write complete, one of which was a dedicated output pin. I connected this output to the scope and was able to confirm that the method of checking write completion used by the flash driver was prematurely signaled write completion by the flash part. Within 24 hours after contacting the flash vendor’s technical support they were able to reproduce and confirm this previously undetected hardware bug. I was able to advert a potential field recall of devices by finding this problem and denying the qualification of the part.