Animus

Talk to the world around you.

[LIVE DEMO →] [VIEW SOURCE →] [GLASSES VERSION →]

What Animus Is

AI-powered web app that identifies any physical object via camera, gives it a personality, and lets you have a full voice conversation with it as itself. Point at your microphone — it speaks its opening line immediately in its own voice. Speak back in any language, it responds in kind. Every object gets a distinct voice and speaking style matched to its personality. Live at animusai.app.

Also available as a native Android app for Meta Ray-Ban Display glasses — AnimusGlasses runs the same pipeline through the glasses camera and speakers.

The Idea

What if the objects around you could speak? Not as assistants, not as tools — as themselves, with personality and perspective. It's an experiment in what becomes possible when AI understands the physical world well enough to give it a voice. The same pipeline — camera, vision AI, voice in, voice out — is what next-generation wearables and smart glasses will run. This is that, in a browser.

What I Built

React + Vite frontend with a custom cyberpunk AR aesthetic — scanlines, glitch text, neon cyan/magenta, chromatic aberration, animated particle network landing page
Gemini Vision API for object identification, personality generation, and voice selection — Gemini picks the voice and speaking style based on the object's perceived personality
Gemini Chat API for stateless in-character conversation — 3-sentence response format where sentences 1+2 are always under 155 chars (speakable), sentence 3 is free
Groq Whisper (whisper-large-v3-turbo) for STT with automatic language detection — hold to speak, release to send
Groq Orpheus TTS (canopylabs/orpheus-v1-english) — 6 voices matched to object personality, vocal direction tags (dramatic, calm, cheerful, etc.) prepended to control delivery
Multilingual voice pipeline — Whisper detects language, Gemini responds in that language, Web Speech API handles non-English TTS
Scan history — every terminated conversation saved client-side, resumable without re-scanning
Spatial markers — neon anchor dots placed at scan positions on the camera view, tap to resume
Object connection mode — scan a second object to put two objects in conversation with each other; user controls pace and can interject between turns
WebXR + Three.js integration — ARCore hit testing for real 3D spatial marker placement on Android Chrome; screen-coordinate fallback on all other devices
AR camera effects — tap ripple, recognition beat (object name flashes on scan complete)
Share feature — Canvas API generates a branded 1080×1080 card with the object's opening line; Web Share API on mobile, image download on desktop, X/Twitter intent
Ghost audio prevention — isMounted ref guards all async audio play calls after panel unmount
Vercel serverless API routes: /api/gemini, /api/transcribe, /api/speak

Architecture

The frontend is structured around a custom hook layer that keeps App.jsx as a lean layout shell (~110 lines) and isolates concerns cleanly.

src/
├── App.jsx                  # Layout/routing shell only
├── hooks/
│   ├── useAnimusState.js    # All domain state + handlers
│   ├── useGemini.js         # Vision, chat, debate API calls
│   ├── useWebXR.js          # AR session, hit testing, markers
│   ├── useAudio.js          # Orpheus TTS + Web Speech fallback
│   ├── useRecording.js      # Mic recording + Whisper STT
│   └── useShare.js          # Canvas card generation + share
├── components/              # Pure UI components
└── lib/
    └── three-scene.js       # ARSceneManager — WebXR + Three.js

Request flow:

Landing page → [INITIATE] → camera view

Camera feed (browser MediaDevices API)
    ↓ screen tap or scan button
useAnimusState → TapRipple → captureFrame() → base64 JPEG
    ↓
/api/gemini — action: 'vision'
    Gemini → object_type, personality, opening_line, voice, vocal_direction
    ↓
ObjectCard → [INITIATE LINK] or [Share] or [Post on X]
    ↓
    ├─ Solo: ChatPanel
    │       useAudio → speakReply(opening_line) on mount
    │       useRecording → hold mic → /api/transcribe (Groq Whisper)
    │       /api/gemini chat → 3-sentence response (1+2 ≤155 chars)
    │       useAudio → speakReply(response, language, voice, direction)
    │           ├─ English: /api/speak → Orpheus wav → Audio()
    │           └─ Non-English / fallback: Web Speech API (BCP-47)
    │       [✕ END] → scan history + spatial marker
    │
    └─ Connect: DebatePanel (if scan history exists)
            Object A → /api/gemini debate → useAudio speakReply
            pause → user interjects (text or mic) → [CONTINUE →]
            Object B responds to A + user → speakReply → pause
            alternates until [✕ END]

/api/speak — sanitize → first 2 sentences → [direction] tag
    → Orpheus ≤199 chars → wav stream

WebXR (Android Chrome + ARCore)
    useWebXR → requestSession('immersive-ar') → hit test source
    on scan: captureHitPosition() → THREE.Vector3
    on terminate: addARMarker() → 3D sphere in scene
    getMarkerScreenPositions() → SpatialMarkers DOM labels
    Fallback: clientX/clientY screen coordinates

Screenshots

Object identified — Studio Microphone

Conversation with the microphone

Why It Matters

Animus demonstrates end-to-end audio AI on physical objects: vision understanding, language-aware LLM response, neural TTS with personality-matched voice, and spatial anchoring via WebXR. The object connection mode shows multi-agent LLM coordination — two separate Gemini sessions maintaining their own conversation history and responding to each other in real time.

The AR layer — hit testing, spatial markers, dom-overlay compositing — is a direct implementation of the same primitives that run on HoloLens and ARCore production systems. The glasses version (AnimusGlasses) takes the same pipeline to real hardware — Meta Ray-Ban Display glasses.

Tech Summary

Technologies: React, Vite, Tailwind CSS, Gemini Vision API, Gemini Chat API, Groq Whisper STT, Groq Orpheus TTS, Three.js, WebXR, ARCore, Vercel, JavaScript