Syncing a Transcript with Audio in React

Keir Lewis

Keir Lewis

4 Apr 2022 • 7 min read

At Metaview, we help companies run amazing interviews. We achieve this in a number of ways including training interviewers with automated interview shadowing, and coaching them with personalized, contextual feedback.

Importantly, we provide a transcript of the interview so that interviewers effectively have perfect recall of what happened. This frees them from note taking during the interview and means they don’t have to rely on memory when trying to recall details.

When reviewing such transcripts in our web app, the audio from the interview can be played back. In this post we’ll explore how to visually sync the transcript with the audio so that the interviewer can follow along without effort.

Here’s an example:

0:00
/

Doing this in React comes with some challenges. We need a way to play audio while programatically knowing the changing playback position. We then need to animate a visual marker of the active word. Importantly, we want this page to perform well across a range of devices, including achieving high frame rates on low-powered devices.

Let’s dive in!

Implementation

First, let’s get our audio playback up and running. To keep things simple for now, we’ll use the <audio> element which  comes with a UI.

To respond to changes in the timestamp as the audio plays, we can listen to the timeupdate event. It is fired “when the time indicated by the currentTime attribute has been updated”. In order to subscribe to this event, we attach a ref to the audio element so we can access its addEventListener function.

const Player: React.FC = () => {
  const playerRef = useRef<HTMLAudioElement>(null);

  useEffect(() => {
    const onTimeUpdate = () => {
      console.log(playerRef.current.currentTime);
    };
    playerRef.current.addEventListener("timeupdate", onTimeUpdate);
    return () => playerRef.current
      .removeEventListener("timeupdate", onTimeUpdate);
  }, []);

  // controls={true} displays the <audio> UI
  return <audio controls src={audioSrc} ref={playerRef} />;
};

We’re using useEffect to subscribe to the player’s timeupdate event, and logging the currentTime to the console for now. The function returned from useEffect is the cleanup method and will remove the event listener when the component unmounts.

Performance

Here’s where things start to get trickier. You might be tempted to:

  • Store the changing currentTime value in React state
  • Pass this as a prop to a child component that highlights the currently-spoken word in the UI
  • Call it a day

This will probably work fine on your powerful development machine. If only all of our transcript users were so lucky. We want this interface running at 60fps on a potato, so we’ll need to think carefully.

The timeupdate event fires very frequently (4Hz to 66HZ), and plugging it straight into React state will trigger frequent React renders. In small component trees this might be fine, but we want to be able to build complex features on top of this transcript component and still have it render at 60fps on low-powered devices.

Thankfully there is another way!

Modifying styles directly

How you style the currently-spoken word depends on the shape of your transcript data. At Metaview, each word comes with a startTime and endTime so we know when the word was spoken relative to the start of the audio.

ℹ️
Here, currentTime is in ‘Seconds’ so we need to make sure that we’re comparing it with other time values in ‘Seconds’, or converting between units appropriately. Later on, I’ll show you a neat way to get Typescript to help us with this.

Here’s the idea:

  • In the timeupdate event handler, find the currently-spoken-word. This can be done by comparing the player’s currentTime to the startTime and endTime of each word.
  • Using refs, find the DOM element corresponding to this word and add properties to its style object to make it stand out.

Let’s modify the code from before to implement this:

const Transcript: React.FC<Props> = ({ transcript }) => {
  const playerRef = useRef<HTMLAudioElement>(null);
  const wordsRef = useRef<HTMLSpanElement>(null);

  useEffect(() => {
    const onTimeUpdate = () => {
      const activeWordIndex = transcript.words.findIndex((word) => {
        return word.startTime > playerRef.current.currentTime;
      });
      const wordElement = wordsRef.current.childNodes[activeWordIndex];
      wordElement.classList.add('active-word');
    };
    playerRef.current.addEventListener("timeupdate", onTimeUpdate);
    return () => playerRef.current.removeEventListener(
    	"timeupdate",
        onTimeUpdate
    );
  }, []);

  return (
    <div>
      <span ref={wordsRef}>
        {transcript.words.map((word, i) => <span key={i}>{word}</span>)}
      </span>
      <audio controls src={audioSrc} ref={playerRef} />
    </div>
  );
};

This code is slightly simplified and omits some boilerplate. active-word is a CSS class containing styles, such as a light background color, to make the active word stand out. This is how the following word highlighting was implemented:

0:00
/

The main thing to note is that we’re highlighting the active words outside of React’s render cycle by mutating DOM properties in a callback.

Performance comparison

Here’s a brief demonstration of the performance difference between storing currentTime in state and reactively responding to updates versus using callbacks and refs to skip React’s render.

To do this, I’m using the Chrome dev tools to throttle my CPU to a 6x slowdown (to simulate a low-powered device) and recording a trace.

Storing in state:

Here’s the flame chart from the performance tab. It’s a sea of red, users on low-powered devices would not be happy. The interface feels sluggish and unresponsive, the active token crawls along significantly behind the audio, and the CPU is straining under the weighty heft of React’s render.

Each timeupdate event is handled in >400ms (and that’s being generous). Considering that achieving 60fps gives us a 16ms budget, we are way off target here.

This is also with a small test React app with a tiny component tree. The latency would be even higher in larger real-world apps.

Using callbacks and refs:

Buttery smooth, not a dropped frame in sight. Users are happy, the CPU is happy, the active token marker is happily skipping along, perfectly aligned with the audio. And yes, this is still with the 6x CPU throttle. The timeupdate event is handled in <1ms and results in 60fps.

Compare that to the almost-half-second event handling attempts from before.

You get the picture.

Accessing currentTime outside timeupdate

What if we need to know the current time position outside of the timeupdate callback? A real example we faced is when trying to implement keyboard shortcut for skipping backwards and forwards.

We use MousetrapJS for binding keyboard shortcuts to callbacks. How could we implement a YouTube-like shortcut to skip backwards 10 seconds?

useEffect(() => {
  Mousetrap.bind('j', () => {
    // Skip backwards 10 seconds
  });
  return () => Mousetrap.unbind('j');
}, []);

Storing the currentTime in state seems like a bad idea. Not only will our FPS suffer (see above), we’ll be binding and unbinding a new callback every frame.

ℹ️
If the callback we pass to Mousetrap uses a state value, e.g. () => requestSeek(currentTime - 10), we would need to put currentTime in the useEffect dependency array, to make sure that the callback has an up-to-date value.

Enter Stage Left: Refs

The useRef hook creates a mutable ref object - changing the value of the .current property doesn’t trigger a React render. It’s useful for more than just accessing DOM nodes.

We can store arbitrary mutable values in refs. This seems like a good option for storing currentTime. Let’s create a ref and update it when the timeupdate event fires. We can now bind our skip 10s shortcut like so:

useEffect(() => {
  Mousetrap.bind('j', () => {
    requestSeek(positionSecondsRef.current - 10)
  });
  return () => Mousetrap.unbind('j');
}, [requestSeek, positionSecondsRef]);

The values of positionSecondsRef and requestSeek don’t change (as long as we create requestSeek with useCallback), so the shortcut will be bound only once.

Aside: Type branding

We’re dealing with time values in a lot of different places, and have to be careful with inconsistent units. The audio currentTime property is in seconds, so we need to make sure we’re not passing it into functions that accept milliseconds, or comparing it with values in milliseconds.

Since we’re using Typescript, we can get some help from the type system to provide some static hints. With ‘type branding’ we can define a type Seconds so we get errors if we use it in the wrong way, while still being able to pass it around as a number.

type Brand<K, T> = K & { __brand: T }
type Seconds = Brand<number, 'Seconds'>;

const x: Seconds = 5 as Seconds;
function seekToPosition(position: Seconds) {
  //
}
seekToPosition(x); // Compiles
seekToPosition(5); // TypeError: Argument of type 'number' is not   
                   // assignable to parameter of type 'Seconds'.
seekToPosition(5 as Seconds); // Compiles

I personally prefer this to embedding the units of such values in variable names so that names are concise.

Pretty playback

Here are some brief tips on how to implement a couple of different styles.

Moving highlight behind words

0:00
/

This style moves a rectangle smoothly from word to word by animating its position and size. As in the previous example, we find the DOM node corresponding to the active word. We then use the node’s offsetTop, offsetLeft, offsetWidth and offsetHeight properties to know where to place the highlight. Adding a transition property to the highlight’s styles makes the position and size animated.

Highlighting the current word

0:00
/

This is similar to the example before, but we don’t animate the position or size. Instead, the highlight element is on top of the words, and we also set its textContent property to the value of the active word DOM node’s textContent.

ℹ️
You’ll want to use textContent and not innerText as it is more performant. Reading innerText triggers a reflow, which is not the case for `textContent`.
🛑
You’ll want to use textContent and not innerHTML as it is more performant. Writing to innerHTML has lower performance as the value is parsed as HTML. More importantly though, it leaves you vulnerable to XSS(!).

Imagine if a candidate said in the interview “left angled bracket script right angled bracket alert left parenthesis quote got your cookie quote right parenthesis...” (etc...).

Ok maybe not, but it’s good to be safe anyway.

Closing thoughts

That’s it! This should be enough to get started, but there’s much more to building a delightful transcript page such as using a custom audio player UI, video playback, code playback, and more.

💙
Like this post? Join our team. Metaview is on a mission to power people decisions with the truth. Check out our open roles.

Get our latest updates sent straight to your inbox.

Subscribe to our updates

Stay up to date! Get all of our resources and news delivered straight to your inbox.

Other resources

How to increase quality of hire with Metaview

Blog • 5 min read

Siadhal Magos
Siadhal Magos25 Nov 2022