Audio guide planning

Museum audio guide script best practices: pacing, crowded exhibits and short attention spans

May 23, 2026

A practical guide to writing museum audio guide scripts: stop length, word budgets, pacing, crowded-gallery cues, translation, audio description and gallery testing.

A good museum audio guide script is short, structured and tested in the gallery. Most object stops should run 60 to 90 seconds, with a source-language master script of roughly 110 to 170 spoken words once pauses, looking time and translation expansion are allowed for. Each stop should do one job: hook attention, describe what the visitor can see, explain why it matters, leave one memorable detail and cue the visitor out. Crowded-gallery scripts should avoid body-relative directions such as "look left". Audio description works best as its own accessible variant, planned alongside the standard tour.

This guide is for curators, interpretation managers, content producers and exhibition project teams writing or commissioning museum audio guide scripts. It covers stop length, word budgets, the structure of a single stop, scripting for crowded and immersive rooms, descriptive non-patronising language, audio description for blind and low-vision visitors, translation control, listen-through measurement, and the checks to run before paying for voiceover.

Write for the standing, distracted listener

A wall label can be re-read. A catalogue essay can be put down and picked up. An audio guide script has none of that. The visitor is on their feet, looking at an object, sometimes moving, sometimes being nudged by the next group coming through the door. Beverly Serrell's tracking-and-timing research across 108 exhibitions, summarised in Paying Attention: The Duration and Allocation of Visitors' Time in Museum Exhibitions, showed that visitors are selective: many skip more than half of the available exhibition elements, and total exhibition dwell time is often short. A script that assumes continuous, patient attention will lose listeners before the end.

The implications for writing are concrete. Sentences should be short, usually 12 to 18 words, with one idea each. The target is first-listen clarity for visitors with mixed first languages, mixed subject knowledge and background noise. The American Alliance of Museums accessible communications guidelines recommend plain, direct language, short active sentences and avoiding jargon. The rule transfers directly to spoken text. Active voice carries better than passive. Concrete nouns carry better than abstract ones. Curator-internal vocabulary should be defined the first time, or cut.

The device also changes attention. In Look2Innovate deployments measured through Look2Guide CMS and Visitor Statistics, dedicated audio-only guides often show average listen-through rates above 95 percent for started tracks. The reason is practical: the device has a single purpose. A visitor holding a museum audio guide is less exposed to messages, camera notifications, maps, social feeds and battery alerts. The script still has to earn attention, but the medium gives it a cleaner chance.

The same discipline applies to the whole tour. A 20-stop tour made of 90-second stops already contains half an hour of audio before walking, waiting, looking and replaying are counted. Longer visits should be split into clearly named routes, highlights, family tours or accessible variants so visitors can stop, sit and resume. In Look2Innovate deployments, large sites such as Musée d'Orsay and The National Gallery run multiple tour variants for exactly this reason.

The five-beat structure for a single stop

Inside the 60-to-90-second window, a stop that engages a visitor without losing them tends to share the same internal shape. The beats are old, borrowed from broadcast journalism and gallery interpretation training, but they work because they map onto how a listener processes a new object: catch attention, ground in what they can see, give it meaning, leave one thing they will remember, and signal the end so they can move on without lingering awkwardly.

A repeatable beat structure for a 60-90 second audio guide stop.
Beat	Target length	What it does	Common failure
Hook	1 sentence, 5-10 seconds	A curious question, surprising fact or vivid image that earns attention	Restating the object label the visitor just read
Describe	1-2 sentences, 10-15 seconds	Anchor the listener in the visible object, materials, scale or gesture	Skipping description because 'they can see it'
Meaning	2-4 sentences, 25-40 seconds	Story, context, why this matters; one clear argument	Cramming biography, period, technique and provenance into one stop
Memorable detail	1-2 sentences, 10-15 seconds	One concrete image the visitor will carry to the next room	Closing with an abstract thesis statement
Cue out	1 short sentence	Signal the end and, if useful, point to the next stop or theme	No cue out, leaving the visitor unsure if the track has ended

The discipline matters more than the labels. A script that opens with the meaning, then describes, then hooks at the end may read well on paper. In the gallery, many listeners decide in the first ten seconds whether to stay. The hook is the part of the script that competes for attention with the room itself.

Pace the script to the room

Read aloud, comfortable museum narration usually sits around 120 to 150 words per minute. That makes the gross budget for a 60-to-90-second stop about 120 to 225 words. The usable script budget is lower. Pauses, looking time, pronunciation of proper nouns, music beds and translation expansion all take space. For a source-language master script, 110 to 170 words is a safer planning range for most object stops.

Useful word budgets for common museum audio guide stop types.
Stop type	Typical duration	Source-language word budget	Use it for
Object stop	60-90 seconds	110-170 words	One object, one argument, one memorable detail
Hero object	90-120 seconds	170-240 words	A work visitors are already likely to stop for
Transition or wayfinding	20-45 seconds	35-90 words	Moving between rooms without forcing visitors to stare at a device
Immersive-room cue	15-40 seconds	25-75 words	Synchronised media, light changes or short scene-setting
Audio description variant	Usually longer than the standard stop	Set by the described object and route safety	Visual access for blind and low-vision visitors

Some languages expand on translation; names, grammar and sentence rhythm also change the recorded duration. If the source-language master script is already at the upper word limit, translated versions will overrun. Writing the source script to the lower bound, then briefing translators with target audio durations, prevents a tour where one language is materially slower than the others.

Pacing is also a function of the room. A quiet study gallery tolerates a slower, more reflective read. An immersive room with synchronised media, or a busy temporary exhibition with a high arrival rate, needs tighter pacing. Serrell's sweep rate index, the floor area divided by average dwell time, is a useful proxy: high-traffic rooms with low dwell times need shorter stops and clearer cue-outs so visitors are not held up. Low-traffic rooms with high dwell times can carry denser stops without congestion at the door.

Voiceover direction belongs in the script and the studio brief. Mark pauses where the listener should look at a specific detail. Mark emphasis where the meaning would be lost on a flat read. If sound or music is layered under the narration, write the script with those beats in mind and check that the mixed master is intelligible in headphones inside the gallery. A track that tests fine in post-production can be unusable in a gallery with stone floors and 200 simultaneous visitors.

Scripting for crowded exhibits and immersive rooms

A crowded gallery breaks several writing habits that work in a quiet room. Spatial cues are the most common casualty. "Look to your left" assumes the visitor is in a fixed position with a clear sightline. In a crowded exhibition with people moving around the object, the left of the visitor is rarely the left of the object. Cues should be anchored to the object itself, for example "the figure on the right side of the painting, the one holding the lantern".

Order independence is the next habit to drop. Visitors arrive at stops in unpredictable orders, especially when triggering is keypad-based or when groups bunch at hero objects and skip ahead. Each stop should make sense on its own. Cross-references should be optional. "If you have already heard the introduction in room one, you will recognise this technique" is safer than "as we saw earlier". Numbering should map to the visible stop number.

Crowded-gallery script edits that reduce visitor confusion.
Fragile cue	Stronger cue	Why it works
Look to your left	On the left side of the painting	The cue stays true when the visitor changes position
As you saw in the previous room	This technique appears again here	The stop still works when visitors skip or change order
Stand directly in front of the case	From any side of the case, find the small silver clasp	The instruction survives crowding and partial sightlines
Now watch the projection change	When the projection shifts to blue, notice the sound underneath	The cue tolerates slight timing differences between visitors

Decide stop-and-listen or walk-and-listen for each stop

Stop-and-listen stops, where the visitor is expected to be still in front of an object, can carry richer description and a slower pace. Walk-and-listen stops, used between objects or in transitional spaces, need to be shorter, simpler, and tolerant of the visitor's attention drifting to wayfinding. The script packet should label each stop type so writers, voice directors and editors use the right density.

Write for the noise the visitor actually hears

School groups, language-switching among visitors, footsteps on stone and ventilation noise all eat into intelligibility. A script that uses too many soft consonants, complex clauses or unfamiliar names in quick succession will lose listeners even with good audio. Test reads in the actual space, with the actual headphones, with the room at peak occupancy, expose this before recording. A studio playback rarely does.

Descriptive, non-patronising language

Plain language means direct, precise language. The American Council of the Blind's Audio Description Project describes effective description as "concise, objective, accurate". The same three words are a useful editorial test for any museum script, including the audio description variant. Concise means each sentence does one job. Objective means the script gives the visitor the evidence to reach the obvious conclusion. Accurate means the script stays within what curators can defend.

Non-patronising language is partly tone and partly stance. Avoid narrating the visitor's reaction, as in "you may be surprised to learn...". Treat difficult subjects directly. Write for adults unless the tour is explicitly for children. The Smithsonian Guidelines for Accessible Exhibition Design make the same point in the context of written interpretation: respect the audience's intelligence while keeping the language usable.

Describe what can be seen before what it means

Visitors who can see the object still benefit from being told what to look at first. "A small bronze figure, kneeling, with arms raised. Its face is turned to one side, eyes closed" gives a listener something concrete to anchor the rest of the stop to. The same description, recorded as a separate audio description variant, makes the stop accessible to blind and low-vision visitors without rewriting the whole tour.

Small line edits that make audio guide narration easier to hear.
Weak line	Better line	Reason
The work demonstrates the artist's innovative engagement with materiality.	The artist left the clay rough, so the fingerprints still catch the light.	Concrete nouns and visible details are easier to follow on first hearing
You will be amazed by the scale of this object.	The vase is almost as tall as a ten-year-old child.	The script gives a usable measure and leaves the reaction to the visitor
The iconography indicates royal authority.	The crown and sceptre tell us this figure was meant to be read as a ruler.	A necessary technical idea is translated once, then explained

Avoid jargon, or define it once

Provenance, polychromy, iconography and chiaroscuro are useful nouns inside a department. They create friction outside it. Translate them into plain language, such as "the painted surface", or define them quickly the first time and then use them. Repeating the unfamiliar term across ten stops makes those stops harder to follow.

Plan audio description as its own script

Audio description for blind and low-vision visitors is a separate writing job. The WCAG principles for digital media, the Audio Description Project guidance and the institutional examples published by the Smithsonian National Museum of Natural History all converge on the same writing brief: describe what the sighted visitor sees, keep interpretation aligned with the standard tour, and let the listener form their own interpretation.

In practice this means a parallel script, usually slightly longer, that opens each stop with a visual description of the object and the space around it, then carries the same interpretive content as the standard tour. The same content workflow should manage both: one CMS, one set of approvals, one set of devices. The accessibility article in this guide series, Accessible audio guides for museums, covers the hardware and operational side; the script side belongs in the writing brief from the first draft.

Simplified-language variants for visitors with cognitive disabilities, autism or low literacy follow similar rules: shorter sentences again, more concrete vocabulary, fewer metaphors, and stops that can stand alone if a visitor stops the tour halfway. These variants also help hearing visitors with limited time and non-native speakers.

Write the source-language script to be translated

Most museums of any size run several language variants. Trend supports up to 32 languages on a single device, and tablet guides such as Look 3 can carry more. The main constraint is usually the time and budget to produce clean translations, plus the risk that one language reads twice as long as the source-language version. Writing the master script for translation controls that risk before it becomes a recording problem.

Keep the source-language script at or below the lower end of the word budget. Expansion in translation will fill the rest.
Avoid puns, rhymes, idioms and culturally specific references that do not translate cleanly. If they are essential, mark them as such for the translator.
Give translators the audio target duration alongside the word count. Words per minute differ by language.
Provide reference pronunciation for proper nouns and historical terms. Voice talent should not have to guess.
Use a glossary that fixes the translation of key museum terms across the whole tour. Consistency across stops matters more than elegant variation in any one of them.

AI-assisted drafting and translation is now realistic for short scripts and for first-pass language coverage. Look2Innovate's AI Content Studio and AI Audio Translate are designed for that workflow, but the quality is still below a professionally written and reviewed tour. The output should be treated as a draft that the museum's curators and a native-speaker reviewer correct before recording, especially for any stop that touches sensitive history, contested attributions or community representation.

Test the script in the gallery before paying for voiceover

Voiceover, editing and translation are the most expensive lines in a content budget, and the hardest to redo. The cheapest moment to find a script problem is before the studio session.

Read every stop aloud against a stopwatch and confirm it lands inside the target duration in the master language.
Walk the route reading the script standing in front of each object, at normal museum noise levels. Words that do not work in the room come out quickly.
Have a non-specialist colleague follow the route with the script and mark any sentence they had to re-read or did not understand.
Have a blind or low-vision tester walk the route with the audio description variant. Their feedback rarely matches what sighted writers expect.
Record a rough scratch read on a phone, load it onto a device fleet such as Trend, and walk the gallery with the actual headphones during a busy day.
Only after these passes commit the script to professional recording, translation and mastering.

Prepare the recording script packet before the studio

The final script packet should include more than prose. Give the voice director, translator and audio editor a table with the stop number, object title, object location, trigger method, target duration, spoken text, pronunciation notes, pause marks, media cues, rights notes and approval owner. That reduces studio improvisation and makes later CMS updates cleaner.

Keep the production IDs stable. The number printed beside the object, the file name, the CMS stop ID and the script stop ID should match. Mismatched IDs can put a late edit in one language onto the wrong stop, especially in a fleet with many language variants managed through Look2Guide CMS.

Measure listen-through

A start count is useful, but it gives only a partial view of attention. The stronger metric is listen-through: how often visitors who start a stop keep listening to the end, or close enough to the end that the interpretive point has landed. On traditional dedicated audio guide fleets managed through Look2Guide, Look2Innovate often sees average listen-through above 95 percent for started audio tracks. That level of completion is hard to reproduce on visitor phones, where the same device is also the visitor's camera, messaging tool, map, ticket wallet and notification surface.

This matters for script writing. A dedicated device gives the writer permission to be calm, precise and sequential. A phone-based tour usually needs shorter chunks, stronger visual prompts and more forgiving re-entry points because the visitor is more likely to be interrupted. The museum goal is to protect the visitor's exhibition mindset once they have chosen to listen.

Scripts are also living content. Visitor analytics from Visitor Statistics can show which stops have high drop-off, which language versions are used and where visitors abandon the tour. Stops that consistently drop visitors are worth re-reading. The script is the part of the tour that is cheapest to revise.

FAQ

How long should a museum audio guide stop be?

For most object stops, 60 to 90 seconds. In the source language, that usually means about 110 to 170 written words after allowing for pauses, looking time and translation expansion. Hero objects can justify longer stops, but a tour built mostly of two- and three-minute tracks loses attention before the end.

How many words per minute should the voiceover be?

Plan source-language narration around 120 to 150 spoken words per minute. Use the lower end for dense, emotional, technical or multilingual material. Brief translators with target durations and word counts, because a clean translation can still take longer to speak.

Should the script say 'look to your left'?

Usually avoid it. In a crowded gallery, the visitor's left is rarely aligned with the object's left. Anchor descriptions to the object itself, for example 'the figure on the right side of the painting, holding the lantern'.

How do you write a script for a crowded exhibit?

Keep stops short, make each one independent of the others, avoid spatial cues that rely on a fixed visitor position, and distinguish stop-and-listen stops from walk-and-listen ones. Test the script in the room at peak occupancy with the actual headphones before recording.

Should audio description be a separate tour or integrated?

In most cases, a separate parallel tour produces better results. Sighted visitors usually need concise object prompts; blind and low-vision visitors need fuller visual description. Use the same CMS and device fleet, and write the audio description script as its own variant.

What reading level should an audio guide script use?

Use plain-language adult prose: short active sentences, one idea per sentence, defined terms and concrete nouns. A sixth- to eighth-grade readability score can be a useful warning light, but the real test is whether a non-specialist can understand the stop on first hearing in the gallery.

How do we keep translations from doubling the run time?

Write the source-language master script at the lower end of the word budget, brief translators with target audio durations and word counts, avoid idioms and puns that expand on translation, and use a shared glossary across all languages so terms stay consistent.