Solving a problem of missing analytics data

June 29, 2024

Not long ago, one of our data analysts noticed we were missing a surprising amount of data about customer navigation, meaning it was harder for us to make sense of how customers journeyed around our site. Obviously, we’d like to know how people explore the site so we can optimize it and keep them coming back. To that end, we track two events in particular: “nav intent” and “nav completed”.

Nav intent is basically just a click. You click on a link to a sale page, that’s nav intent. Nav completed occurs once you land on the sale page. We stitch these two events together using an ID, navigationID.

What this analyst noticed is that we had a lot of nav completed event IDs that didn’t have a corresponding nav intent event. We also had a lot of nav intent IDs without the nav completed. The latter circumstance isn’t too surprising — we’d expect some loss in nav completions for well-known reasons: the network could be slow, a user might close their browser before the event can fire.

But we were still seeing more than we’d expect, and the lost nav intent events were puzzling. In some cases, we were missing 40% of them. Isn’t sendBeacon supposed to handle this sort of thing?

More puzzling, still, was that the dropped events got noticeably worse for certain components (buttons, menus, marketing callouts) at specific times. The analyst present dozens of charts showing where and when these changes occurred. It looked like a classic situation in which someone had deployed code that caused a problem starting at 3pm on a given Tuesday, then a different problem on a different Thursday; maybe something was sort of fixed on a Monday. Find the code that went out, investigate what it might have broken tracking, and fix it.

But this challenge wasn’t quite so tractable. I went through dozens of PRs, tried to line up code deployments and feature toggle changes with our dropped events, and tried to find a consistent way to reproduce the failures. But, man, it was frustrating.

Take a step back

Sometimes it’s best to take a breath and look at the problem from first principles: how do we want the nav completion event to occur and how do we actually implement it? When a customer clicks a link, we set a cookie with a unique ID. When the customer lands on the target page, we find the cookie and grab the ID. This enables us to stitch together the intent with the completed events.

The first thing I noticed is that the TTL for the cookie was only 1 second. That’s not a lot of time for the real world of old mobile devices, limited bandwidth, and uncertain networks. Further, we grabbed the cookie in our Express app at the point when the response had been sent to the client — not as soon as the request came in — so there was an even greater chance the one second TTL would elapse.

This was an easy thing to test. I grabbed the cookie sooner in the flow so that we’d be more likely to have the ID we needed. This helped, but only by a little bit. I followed up by increasing our cookie’s TTL to 4 seconds and this basically eliminated our problem of dropped nav completions!

Missing beacons

Handling the missing nav intent events seemed more perplexing, because it appeared we were doing everything right. I had a queue that we could fall back on if necessary. At the end of the page lifecycle, we listened to the visibilitychange and pagehide events (this is a great post about the ins-and-outs of all this stuff, by the way).

I considered using fetch with keepalive=true. I considered unconventional methods like saving event data to localstorage and trying to send it later. I investigated whether our queue of beacons was getting too long and needed refinements.

We had logic that seemed clever — it had, in fact, solved problems for us with other, non-nav events. But perhaps it was a bit too clever. Whenever an event needed to fire, we created a payload, added it to a queue, and immediately triggered the queue. This would fire the sendBeacon and, if it failed, add the payload right back onto the queue for a second chance later.

I decided to try something simpler, and just send the nav intent events as soon as possible. Skip the queue. Skip the visibilitychange event listener and fallback listener. My hypothesis was that when a user is navigating to a new page we have very little time to send a beacon.

I pushed the change, waited a few days for meaningful data, and... the problem was fixed! My takeaway is that simple is often superior. In this case, simply firing a sendBeacon at the time we need to send the beacon is the most effective approach.