Akademisk Radioklubb

LA1K / LA1ARK / LA1UKA

Exploring why WSJT-X didn’t want to decode – an illustrated guide to printf-debugging

We’ve recently been active on EME again – but what we didn’t mention was that we weren’t using ARK’s computers for decoding, but rather LA3WUA Øyvind‘s personal computer due to unclear and unresolved issues. Computer detectives has since then been on the case, and we’ve found evidence for USB crimes.

1 – The problem

We have a dedicated Linux machine at the satellite/EME station, with WSJT-X (Weak Signal Communication Software) installed, which is an application for running e.g. JT65C QSOs.

Apparently, there was a bug, either in the application, on the computer, or somewhere else, causing WSJT-X to never decode any samples. We’d push the “Decode”-button, and it would stay blue forever, never attempting a decode and turning back to normal grey.

There was no feedback, we had no idea of what it could be. The only remaining approach left to us was to see what the source code of WSJT-X could tell us.

2 – Discovering the sample rate error

And what we found was a discrepancy in the data rate of the sound card of IC9100.

2.1 – WSJT-X operation and the mysterious counter

Armed with knowledge of Qt (the GUI toolkit used for displaying buttons and hooking up signals), we looked for what the “Decode”-button would trigger. We ended up in a function called decode(). This function pushes sampled data to a shared memory block, which is read from a separate Fortran program called JT9, which prints out decoded data to standard output, which is read in from WSJT-X again… and finally shown in the GUI and the button made non-blue. The author therefore thought that the fault was with JT9 and we just had to tune some decoding parameters. Turned out that our WSJT-X never even reached this point, however.

Before the decode()-function can continue on to actual decoding, it also checks a variable called m_dataAvailable. This variable is normally set in a different function called dataSink(). Further scrutiny revealed that this function is called whenever there is a new bulk of data available from the sound card, and that this variable would be set only when the following two variables were equal:

printf output for the counter and what the counter strives to become.

The noble art of printf debugging could reveal that one of them was a counter which was incremented every time dataSink() was called, and that should be equal to 179 in order for decoding to start. It never reached this point. It was reset every time the following progress bar was reset:

We fiddled with volume and fruitlessly looked into the sound card information, but no luck.

2.2 – Discovery of first error indication by accidentally sampling the wrong sound card

Here, my programming partner Øyvind got disgusted with the endless frustrations, turned off IC9100 and went home. I was almost about to leave for home as well, but thought, nah, let’s try some more. While Øyvind went down the stairs in order to drive home and go to sleep in a bed made wet of angry tears or something, the counter suddenly reached 179, and the “Decode”-button went gray. Was it Øyvind’s fault all along?

More investigation revealed that I was currently sampling from the computer’s internal sound card, and not IC9100, because I turned on IC9100 after restarting WSJT-X. Turning back to IC9100 made the counter go back to its previous behavior. It was also visually very easy to see that the numbers were printed with different speeds when sampling from the internal sound card and when sampling from IC9100. There was therefore something off with the sampling rate.

The decoding functionality of JT9 assumes that the data is sampled at a fixed rate of 12 kHz. WSJT-X achieves this by sampling at 48 kHz and decimating by a factor of 4. Additionally, only frames within a period of 60 seconds are attempted decoded. The counter above probably counts up the number of collected frames, compares it against the required number, and resets the counter every 60 seconds, all under the assumption that the datarate is constantly the datarate asked for.

2.3 – Comparing effective sampling rates

Since the dataSink() function receives information about the amount of received data, we could easily calculate effective sample rates and plot them as a function of time.

Found the problem! Sampling the internal sound card gives sample rates around the expected 48 kHz (or 12 kHz after downsampling), while IC9100 is not sampled with the correct sample rate (around 8 kHz after downsampling, 32 kHz before downsampling).

IC9100 thinks it delivers the correct sample rate, however.

Interestingly, it samples with the correct sample rate just after having been powered on:

It drops down after some time. Simultaneously, dmesg starts printing out error messages:

[359142.333124] retire_capture_urb: 35 callbacks suppressed
[359147.337113] retire_capture_urb: 1293 callbacks suppressed

In conclusion: Any sound card is sampled at sample rate 48 kHz, while the soundcard in IC9100 is for some reason sampled at 32 kHz, though it still thinks it has the sample rate 48 kHz. WSJT-X therefore doesn’t get enough samples for decoding through the designated timeframe.

We’d got a fair indication of the error and a quantifiable way of verifying whether it is present or not, and the question was then whether it was a hardware or software error.

3 – Testing WSJT-X on different hardware

The same version of WSJT-X was modified and compiled on my own laptop, and IC9100 was sampled. I got the correct sample rate:

Apparently, it worked fine on different hardware.

The question was then whether it was the hardware on the shack computer, the differing Linux kernel versions between the shack computer and my own computer, or something else. Before we could move on to that, we had to verify once and for all whether the behavior was due to something present only in WSJT-X, or whether we could reproduce it by a conventional sound recording application like arecord.

4 – Unsuccessfully reproducing the problem using sound recording software

Using arecord to record a data for a specific duration at a given sample rate produced no discrepancies – sampling at 48 kHz for 10 seconds produced 10 seconds of sound exactly, and there was no weirdness in the recorded sound file like increased pitch. It sounded exactly like the output from IC9100’s speakers, though I’d get retire_capture_urb messages and I should have had the same problem. I suspected that arecord probably made sure to record the necessary number of samples for the given duration, and that I’d find no problems here. We needed to measure the effective data rate at the given moment, throughout the measurement, not investigate the collected file.

At this point, it is allowed for the reader to suspect that the problem is not with the actual sampling rate, but with the data transfer rate. We’ll still use the words “sampling rate” until we reach the conclusion, where also the author can realize that it was the data rate all along. (In reality, he just can’t be bothered to change the figure labels.)

arecord can print samples to standard output. We therefore developed a small program for receiving samples into its standard input, and continuously calculate effective sample rates of the incoming data. The output of arecord was piped to this application, and effective sample rates where printed to standard output:

arecord -D [device] -r 48000 -f FLOAT_LE -t raw | ./arecord_collector

It has to be admitted that this solution will be influenced by CPU scheduling and interrupts, and will probably show a lower, irregular sampling rate, but it should be pretty close what WSJT-X would show.

Testing the setup on the normal sound card gave us effective sampling rates pretty close to 48 kHz (or what would be 12 kHz in the downsampled case). Testing it against IC9100 (after powering it off and on) gave rather dejecting results, however, since we got effective sampling rates close to what we were supposed to get, not what we ended up getting in WSJT-X:

46608.955717
47722.066684
47784.127241
47555.333851
47816.346975
47832.133129
47693.846387

Recording sound alone didn’t reproduce the reduced sampling rate problem, no matter for how long we recorded audio.

5 – Successfully reproducing the problem using audio recording software – but after having started WSJT-X once

Thinking that WSJT-X might be triggering something, we started WSJT-X, verified we still had the same problem, quit WSJT-X and fired up our probing tool – and bingo:

We were able to reproduce the problem also using arecord. We’d also continuously get the retire_capture_urb messages printed to dmesg, which we didn’t get earlier before starting WSJT-X. Woop!

The sample rate reduction was applicable also to other requested sample rates. Asking for 32 kHz would produce something below 25 kHz after having started WSJT-X.

WSJT-X would therefore trigger something which caused the effective sample rate to drop. Was WSJT-X handling audio in a different way?

To check this, we went on to take out the audio code in WSJT-X into a minimal application which took in sound from QAudioInput and pushed it into a class based on QIODevice, and calculated a similar effective sample rate.

The results here were no different from the results obtained from arecord. Turning off and on IC9100 produced sample rates close to 48 kHz, and the sample rate did not drop until we had started and shut down WSJT-X.

We scrutinized the audio code some more, and we couldn’t see anything there we hadn’t already reproduced in our minimal QAudioInput example, so the audio handling was declared innocent.

6 – Circling in – reproducing the problem using sound recording software and rigctl

The single thing WSJT-X does with the rig which we don’t in the pure audio recording tools is CAT communication, like controlling or reading off the current frequency of the rig. This is done using rigctl.

We therefore shut down everything, started our sample rate tracker, fired up rigctl – and set a new frequency to the rig.

Suddenly, the effective sample rate started falling, while the retire_capture_urb message began being printed to dmesg. Jackpot! We’d reproduced the problem without involving WSJT-X at all! We haven’t reached 32 kHz in the example above due to the mean sample rate being a mean of all samples over the entire duration, but later calls had rates close to 32 kHz.

At this point, I realized I’d never actually enabled CAT on my own computer. Maybe I’d found the culprit? Maybe it was a rigctl bug. Alas, after having set up CAT on my own computer, it still sampled IC9100 at an annoyingly regular sample rate, and WSJT-X was working fine :-).

7 – Testing a different set of USB contacts

It had to be a hardware problem. Maybe the internal USB hub thing was broken. Luckily, we had an independent row of USB 3.0 contacts we could try.

Moving a cable from USB 2.0 to USB 3.0 ports.

Then, everything was suddenly fine even on the shack computer. Regular sampling at 48 kHz! Decoding button turning blue. No problems.

8 – Realizations – it works only on USB 3.0 ports, regardless of computer in use

I realized I’d only been testing IC9100 on the (blue) USB 3.0 ports of my computer.

Two blue USB 3.0 ports on the side of my computer. (The double Windows stickers cancel out and become Linux.)

I also have a yellow port, however.

Yellow USB port of unknown designation (later revealed to be USB 2.0).

Plugging IC9100 into this port resulted in the same problems as the shack computer. We were onto something. Turns out both this yellow port and the other row of ports are USB 2.0 ports. For some reason, IC9100 wanted to work only on USB 3.0.

9 – Maybe it was the USB hub’s fault all along

Now, there is something that has not been mentioned yet. Some time ago, we started using an USB 2.0 hub between the computer and IC9100 in order to ensure that the internal USB hub of IC9100 had enough power on computers which did not supply enough power to their USB contacts.

Neither my own or the current shack computer has problems with this, however, so we also tried to connect IC9100 directly to both USB 2.0 and USB 3.0 ports on both my own computer and the shack computer, without any intermediate USB hub. No sampling rate trouble, everything fine.

10 – Conclusions

Summary: The sound card in IC9100 starts to deliver samples at data rate 32 kHz when it was supposed to yield data rate 48 kHz, under some specific conditions:

  • Connecting IC9100 to hub to USB 2.0: Receives audio at data rate 48 kHz.
  • Connecting IC9100 to hub to USB 2.0, sending commands through CAT interface/WSJT-X operation: Receives audio at effective data rate 32 kHz. Error triggered.
  • Connecting IC9100 to hub to USB 3.0, sending commands through CAT interface/WSJT-X operation: Receives audio at data rate 48 kHz, all fine.
  • Connecting IC9100 directly to USB 2.0, sending commands through CAT interface/WSJT-X operation: Receives audio at data rate 48 kHz, all fine.
  • Connecting IC9100 directly to USB 3.0, sending commands through CAT interface/WSJT-X operation: Receives audio at data rate 48 kHz, all fine.

We don’t specifically know why (mostly because I didn’t want to go any deeper at this point, not wanting to debug the USB protocol), but we speculate that it has to do with the data rate of the USB port at the computer being divided among the ports of the hub and then the ports of IC9100’s internal USB hub.

Judging from the behavior of arecord when asking for specific durations, the data was not actually sampled at 32 kHz, just lagging behind, and still producing normal sound. This indicates that the data transfer rates were too slow, and that the sound card itself was sampling at the correct sample rate. Maybe both CAT and sound operations can’t be supported at the required rates when the USB 2.0 rate is divided among 4 ports. Or maybe our USB hub is just weird.

In the end, we’ve fixed it by plugging the hub out from the USB 2.0 port and plugging it into the USB 3.0 port. Job done!

3 Comments

  1. Hi folks,
    see https://sourceforge.net/p/wsjt/mailman/message/36623873/ for my reply on this.

    BTW, great detective work!

    73
    Bill
    G4WJS.

  2. Holmes, Lewis, Morse and Vera all in one… Well done.

  3. I’ve experienced similar issues, although not with amateur radio gear but with equipment at work. A lot unfortunately can be attributed to the USB protocol, specification, and implementation. Especially when dealing with USB 3.0 ports and backwards compatibility (in my experience) performance really suffers.

    The problem in many of these cases is the implementation / (interpretation? ) of the USB specification by the manufacturer. Some manufacturers will skimp on parts and, e.g, not adhere to the USB 3.0 spec as per spec. It’s a headache. 😛
    Glad you guys were able to figure it out!
    73 LB1TI

Leave a Reply

Your email address will not be published.

*