Figure 0. From text-only captions to real-time semantic signal. WallSpace transforms speech into structured, expressive, and machine-readable outputs that restore meaning and enable visual and system-level response.

This work was developed in close collaboration with WallSpace.Studio over an intensive 1.5-week period, spanning time zones and driven by rapid iteration, structured testing, and extended development sessions. The result is not simply improved captioning, but a fundamental shift towards treating speech as a real-time semantic signal capable of driving visuals, systems, and interaction. For the global deaf and hard-of-hearing community, this represents a meaningful step forward—offering richer access to communication, live environments, and creative experiences as the system continues to evolve and scale.

0. Introduction

Access to spoken word and music in live environments remains fundamentally limited for deaf audiences. Traditional solutions - such as human interpreters or basic captioning - are often constrained by availability, latency, lack of synchronisation with music, cost and poor integration into visual experiences.

In particular:

· Live music is largely inaccessible beyond low-frequency vibration

· Lyrics are rarely available in real time, especially in festival or club settings

· Existing captioning tools are not designed for performance environments (large screens, dynamic visuals, multi-source audio)

Within this system, Wallspace Captions operates as a core visual layer, treating captions not as an accessibility afterthought - as is often the case in the author’s lived experience as a profoundly deaf person - but as a primary, performance-driven visual element.

The system enables:

· Real-time speech transcription

· Automatic song recognition and synchronised lyric display

· Visually integrated captions designed for large-scale projection and LED environments

Unlike conventional captioning systems, this approach is:

· Engine-agnostic (multiple caption sources: Whisper, browser speech, music ID + lyrics)

· Latency-aware and adjustable in real time

· Designed for integration with VJ pipelines and visual systems

There are currently no widely adopted systems that combine real-time audio analysis, caption/lyric generation, synchronised visual rendering, and live performance integration within a single pipeline.

This project explores that full stack, with the goal of making live audiovisual experiences fully perceivable and engaging for deaf audiences, not just understandable. This work positions captions not as an auxiliary layer, but as a primary medium for visual expression and control.

The Github link is here

This system has also been published as a Scope node for public use, testing, and further development.

https://app.daydream.live/nodes/AEYESTUDIOS/wallspace-captions

Audio Input → Caption Engine → Text Stream → Event System → Visual Behaviour → Scope Rendering → Visual Output

Figure 1 – Wallspace Captions System Flow

1. Authorship & Contributions

Wallspace Captions was co-developed by this document’s author Matthew Israelsohn in collaboration with Jack Morgan of Wallspace.Studio, with both contributors working on system design, implementation, and iteration of the caption-driven visual pipeline.

The sign language conducting system, audio reactivity research, various technology comparison papers and accessibility tools were developed independently by the author.

Human Context & Collaboration

This project was developed during the AI Video Cohort in close collaboration with Jack Morgan (WallSpace.Studio).

As a profoundly deaf creator, my primary focus was not just transcription accuracy, but how speech, tone, and meaning can be translated into visual systems that are perceptible and expressive without sound.

During the cohort, we spent extensive time working together to push accessibility beyond “captions on screen” toward a deeper, system-level integration:

Speech -> semantic analysis -> emotion + voice features -> visual mapping -> AI prompt generation

This collaboration directly influenced the development of WallSpace’s Caption Intelligence Pipeline, where captions are no longer just text overlays, but structured signals that drive:

· visual styling (colour, motion, emphasis)

· timing (word-level reveal, pacing)

· generative AI prompts (emotion, tone, meaning)

In parallel, my Wallspace Captions node focused on the presentation and visualisation layer within Scope, while WallSpace handled signal processing, routing, and integration across the wider system.

The result is a shared architecture where accessibility is not an add-on, but embedded throughout the entire pipeline - from microphone input to final visual output.

This approach reflects a broader goal:

To transform spoken language into rich, real-time visual experiences that can be understood, felt, and performed - not just read.

Thank you for collaborating with me, Jack.

2. Features

· 4 text input methods: Scope prompt field, manual text, OSC (UDP), WebSocket

· Pre + Post pipelines: Pre-bakes text into frames so AI stylises it; Post overlays clean captions after AI generation

· Advanced caption placement: XY coordinate positioning (percentage-based), preset positions (top/center/bottom), text alignment

· Full styling control: Font size, colour (RGB), opacity, text outline with colour/width, background box with colour/opacity/padding/corner radius

· Caption Event System: Parses text into structured events (WORD, SENTENCE_START/END, QUESTION, EXCLAMATION, PAUSE, EMPHASIS, SPEAKER_CHANGE) that drive visual behaviours

· Event-reactive effects: Per-word flash, punctuation colour reactions, pause fade, emphasis highlighting

· Prompt forwarding: Transcription text forwarded as prompts with style prefix, template formatting, and rate limiting

System Context: Multi-Input Visual Pipeline

This project sits within a broader system exploring multiple input modalities for controlling visual systems in real time.

Three parallel input streams are being developed:

Audio Input → Caption Engine → Text Processing → Visual Rendering (Scope)

Figure 2 - Multi-Input Visual System Context

This diagram shows how multiple input modalities - audio (speech/music), gesture (sign language conducting), and audio feature analysis - are processed in parallel and unified into a shared visual pipeline. All inputs are converted into structured control signals (text events, audio features, OSC data) which are combined within Scope and rendered through Resolume as a cohesive real-time visual output.

1. Audio -> transcription / lyrics -> caption events

2. Gesture -> motion tracking -> OSC control signals

3. Audio features -> spectral / temporal analysis -> modulation signals

These are unified through a shared routing and rendering pipeline (OSC -> Scope -> Resolume), enabling multiple forms of expression to drive visuals simultaneously.

Caption Event System (Core Innovation)

Wallspace Captions does not treat text as static output.

Incoming text is parsed into structured semantic events which can drive visual behaviour in real time.

Event types include:

· WORD

· SENTENCE_START / SENTENCE_END

· QUESTION / EXCLAMATION

· PAUSE

· EMPHASIS

· SPEAKER_CHANGE

This enables a second processing layer:

(Text -> Event -> Visual Modulation)

Audio Source (Mic / System / Stream)

→ Caption Engine (Whisper / Browser Speech / Shazam + LRCLIB)

→ Text Stream (Timestamped Segments)

→ Caption Logic Layer (Timing / Filtering / Formatting)

→ Scope Node (Rendering + Layout)

→ Visual Output (CRT / Projection / Screen)

Figure 3 - Caption Event System: Secondary Processing Layer

Examples:

· Per-word flashing synced to speech rhythm

· Colour changes triggered by punctuation

· Fade/hold behaviour during pauses

· Emphasis-driven scaling or highlighting

This transforms captions from passive information into active visual control signals, enabling expressive, performance-ready visual systems.

This effectively converts language into a real-time visual control signal.

Use Cases

· Accessibility: Live captions for deaf/hard-of-hearing audiences at live events (A.EYE.ECHO integration)

· VJ performance: Spoken word -> AI-generated reactive visuals in real-time

· Live events: Audience speech drives projected visuals

· Art installations: Text-reactive generative art

1. System Overview

Wallspace Captions is a real-time captioning and lyric visualisation system designed for large-format visual environments (e.g. LED walls, projection mapping, CRT installations). It converts live or captured audio into synchronised text and renders it visually as part of a VJ/visual performance pipeline.

Audio Input → Caption Engine → Text Processing → Visual Rendering (Scope)

Figure 4 - Wallspace Captions System Pipeline

This enables spoken word and music to be transformed into visual language in real time.

2. Data Flow Architecture

Audio Source (Mic / System / Stream)

→ Caption Engine (Whisper / Browser Speech / Shazam + LRCLIB)

→ Text Stream (Timestamped Segments)

→ Caption Logic Layer (Timing / Filtering / Formatting)

→ Scope Node (Rendering + Layout)

→ Visual Output (CRT / Projection / Screen)

Figure 5 - Captioning System Data Flow

3. Components

Audio Input Layer

Supports microphone, system audio capture, and web streams. Role: Provides real-time audio feed for transcription or identification.

Caption Engines

Whisper: Speech-to-text (local or remote). Browser Speech: Low-latency fallback transcription. Shazam + LRCLIB: Music identification and synced lyric retrieval. Role: Converts audio into structured text (speech or lyrics).

Caption Logic Layer

Handles timing alignment, latency compensation, and formatting. Includes dynamic sync offset and trim controls. Role: Ensures captions are synchronised and readable in real time.

Scope (Rendering Engine)

Node-based visual system for rendering captions. Supports layout control, styling, and integration with other visual pipelines. Role: Final visual output layer.

4. Protocols & Data Handling

Audio Stream: Real-time input from system/mic.

Internal Data: Timestamped text segments (structured data).

SSE/WebSocket: Used for live updates within Scope.

Optional OSC/MIDI: Enables integration with external visual systems.

5. Example Data Flow

Audio Input → Whisper → Timestamped Text → Caption Logic → Rendered Caption

Speech mode: Figure 6 - Speech Mode: Real-Time Transcription Pipeline

In speech mode, live or recorded audio is processed through a transcription engine (e.g. Whisper), producing timestamped text segments. These are converted into structured caption events and passed into the rendering system, where they are displayed as synchronised visual captions in real time.

Music mode:

Audio Input → Shazam → Track ID → LRCLIB → Synced Lyrics → Rendered Output

Figure 7- Music Mode: Audio Identification to Synchronized Lyric Rendering

Incoming audio is analysed via fingerprinting (e.g. Shazam) to identify the track in real time. Once identified, synchronised lyric data is retrieved and aligned with playback timing, generating a continuous stream of timed text events. These events are rendered dynamically, enabling lyrics to function as both captions and visual performance elements.

6. Current Mid-Cohort Status

✔ Multiple caption engines integrated

✔ Real-time transcription and lyric sync working

✔ Dynamic sync offset and trim controls implemented

✔ Rendering in Scope working

⚠ Visual styling and advanced layout in progress

⚠ External control integration (OSC/MIDI) exploration

7. Next Steps

1. Refine caption visual design for large-scale displays 2. Improve latency handling and sync accuracy 3. Add OSC/MIDI hooks for external control 4. Integrate with full visual performance pipeline 5. Record demo video

8. Relation to Other Work

This article represents the integration layer of four ongoing strands of work:

– A real-time caption and lyric system (Wallspace Captions)

– A gesture-based conducting system using Flowfal

– A research-driven audio reactivity engine for TouchDesigner

- A transcript-driven visual control system

Together, these form a unified approach to translating sound and movement into visual experience for deaf audiences.

See linked articles for deeper technical breakdowns of each subsystem.

9. Accessibility Tools

Accessibility Tooling (Supporting Layer)

During development, a number of supporting accessibility tools were created to address limitations in real-time communication platforms for deaf users.

These included system audio transcription pipelines, video OCR, and transcription processing tools.

These tools informed the design of the caption-driven visual system and are documented separately & linked to from the top of this article.

Wallspace Captions: Real-Time Visual Systems for Deaf Audiences