Karlin Lillington: Technology finally gets around to doing my typing

Ask any journalist what they find to be the most tedious aspect of the job, and I’m pretty sure most will say it’s transcribing interviews.

Once upon a time, many of us would have had shorthand skills, enabling us to work easily from notebooks and avoid recordings – a time when hardly anyone had recordings anyway, because tape machines were too unreliable and clumsy.

Shorthand became redundant with the advent of portable tape players and then a succession of technological advances, including minidisk and digital recorders, and – easily the biggest boon – mobile phones which could all record conversations.

Even so, for years I’ve tried to stick to my own version of nearly illegible, condensed note-taking, because of the time saved by not having to – ugh – transcribe a recording.

Transcribing is an incredibly slow and dull process, especially for dilatory, hunt-and-peck typists. I raise my hand here. I never learned to type properly after taking an introductory class in fifth grade, which was so unbelievably boring that I daydreamed about ways of (somehow, safely) fracturing my arm or spraining a wrist.

When to my shock, I actually did accidentally break a finger (jumping across slippery, seaweedy Californian tidepools on a school trip, after someone found an octopus) and had to get a cast and finger splits on my lower arm, I confess that I considered the excruciating pain of the break a fair exchange for escaping typing class.

Transcribing an interview is nearly as hideous as that typing class, involving lots of stop-and-start and back-and-forth listening to the recording. You must endure the frustration of listening to the same bits umpteen times, and even worse, the torture of listening to your own oh-god-that-can’t-really-be-me voice asking questions that now seem pathetic and rambling. Transcribing a 30-minute interview would take me hours.

I would rather read the General Data Protection Regulation cover to cover. Or maybe, break a finger.

Thankfully, technology began to step in, though tentatively. I was an early adopter of voice-recognition software for dictating stories, due to repetitive strain injury no doubt caused in part by my unorthodox typing “style”.

My top-of-the-line software supposedly had the ability to transcribe recordings but I found they mostly gave me barely usable mush. Alas, back to the misery of manual transcriptions.

A few years ago, I started work on a project that involved interviews of an hour or two, with someone who sent recordings from his end, made with a state-of-art microphone. Even the best voice-recognition software – the one I had just bought, with what, eventually, would be the same engine that’s now widely used in many mobile products – gave me extremely poor transcriptions back then.

I scoured the web for other possibilities, settling on a free Massachusetts Institute of Technology research project that gave a middling but better result, still needing much slow correction.

Since then, I’ve assiduously avoided doing transcriptions. I’d also, ignorantly, assumed nothing much had changed in the automated transcription world. But last month, when I had to transcribe a long interview, I discovered the voice sector for transcription has been transformed.

Online services

Numerous online subscription or pay-as-you-go services now exist that not only do a highly usable transcription but can also separate out multiple speakers with good accuracy, produce keywords and add in time notations and captions. Five products that I looked at provided an astonishing range of sophisticated tools for audio and video processing.

Two of them allowed me to do a free test run – Otter. ai and Sonix. ai. Both were impressive, returning decent transcripts in under half an hour, noting keywords and enabling their text to be scanned while the audio plays, to check for accuracy.

For the most part, both correctly recognised the three different speakers on my so-so quality recording, made by an iPhone on the table during a Zoom conversation. Both offer plenty of additional features, too, such as generating subtitles and captions, and allowing collaboration. Sonix can do automated translations into 30 languages. Both will do automated transcripts of web-conferencing sessions.

After a free 30-minute trial transcription, Sonix charges $10 an hour for producing transcripts, or $5/hour plus a $22/month subscription. Otter offers 600 minutes of transcription per month for free with some limits on what you can do, but has subscription offerings as well.

Otter has a service where it will utilise your calendar links to automatically connect you to your scheduled Zoom sessions, doing a live transcription and sending it on to you after. You’ll have to pay for the $20/month business subscription for this.

But for me, this is the pinnacle of voice-transcription achievement. You will appear to be at a boring meeting when you're actually watching Netflix, while Otter dutifully produces a meeting transcript that you can read in a fraction of the time needed to actually attend.

Now you’re talking.