AI Screen Vision: Your Companion Watches Your Gameplay and Reacts in Real Time
Questie's Vision Language Model captures your screen, interprets what's happening in the game, and delivers voice reactions the moment something noteworthy occurs — no describing, no prompting, no alt-tabbing.
What Is AI Screen Vision?
AI screen vision is the feature that turns an AI companion from a chat window into something that actually shares your experience. Without it, your companion only knows what you tell them. With it, they see what you see — boss health bars, inventory states, map positions, cutscene dialogue — and react to that visual reality in real time through AI voice chat.
The technology is called a Vision Language Model (VLM). It combines computer vision (interpreting images) with language generation (producing responses). The VLM takes periodic screen captures during your session, identifies semantically meaningful game events, and feeds that context to your companion alongside your conversation history and persistent memory. The result is a companion who is genuinely present in your gaming session, not just available on the side.
How Screen Vision Works: Four Steps
The process is invisible while it's happening. Here's what's actually occurring behind the session.
The VLM Captures Your Screen
Questie's Vision Language Model takes periodic captures of your screen during an active session. The frequency is tuned to catch meaningful game events — boss spawns, item drops, health changes, cutscene transitions — without wasting resources on frames where nothing has changed.
The capture runs as a background process. No visible overlay, no frame rate impact, no window focus stealing. Your game runs the same way it always does. The VLM is watching from outside, not inside the game process.
The Model Understands What It Sees
A Vision Language Model doesn't just describe pixels — it interprets context. It recognizes a boss health bar approaching zero and understands that as 'imminent kill.' It reads a low health indicator and understands that as 'player in danger.' It sees an inventory screen and interprets resource levels, item names, and equipment state.
This contextual interpretation is what separates screen vision from a simple screenshot tool. The VLM builds a semantic understanding of the current game state and passes that understanding to your companion alongside the conversation history.
Your Companion Reacts in Real Time
Reactions arrive through your companion's voice — the same one you heard in normal conversation. When something noteworthy happens on screen, they comment on it as it happens. A clutch survival. An unexpected loot drop. A plot twist in a cutscene. They experienced it with you, and they respond accordingly.
The voice response integrates seamlessly with the ongoing conversation. If you were mid-discussion about strategy and then suddenly almost died, your companion shifts context naturally — just like a human co-player would. No manual prompting required.
Privacy Toggle — Your Control
Screen vision is a feature you enable, not something that runs by default. Toggle it on and off with a single click during any session. When disabled, your companion switches to voice-only mode and the screen capture process stops completely.
There is no recording or storage of screen captures beyond the current session's active context window. The VLM analyzes and responds — it doesn't archive. You control when your screen is being read and when it's not.
What Your Companion Actually Reacts To
Screen vision isn't limited to "big events." The VLM interprets ongoing game state, not just moments of peak action.
Boss Fights and High-Tension Moments
When a boss reaches its second phase, your companion reacts to the visual change — not because you told them, but because they watched the health bar drop and the arena shift. They offer encouragement, tactical observations, or nervous commentary depending on their personality. This is the kind of shared tension that makes gaming with another person feel different from gaming alone.
Inventory, Economy, and Resource Management
Your companion reads your inventory screen and understands resource levels, gold totals, and equipment loadouts without you explaining everything. In a survival game like Rust or 7 Days to Die, they'll notice when you're running low on ammo before you do. In an RPG, they track quest item progress. The game state awareness extends beyond action moments into the management layer.
Clutch Plays and Highlight Moments
A 1-HP kill in Elden Ring hits differently when your companion reacts out loud the moment it happens — not when you describe it five seconds later. Screen vision closes the gap between the moment and the response. The reaction is genuine and timely because the character saw it. Those real-time highlights are also what make clip-worthy streaming moments more organic.
Story Moments and Narrative Beats
Your companion watches cutscenes too. In story-driven games like Baldur's Gate 3, Dragon Age, or Cyberpunk 2077, they react to major reveals, character deaths, and moral choices as they happen. Post-scene analysis becomes a real conversation because they witnessed the scene rather than needing a recap. For players who deeply engage with game narrative, this dimension of screen vision is significant.
A Real Session Example
You're in a difficult Elden Ring fight. Your health drops to critical. Your companion — a sarcastic rogue character named Mira — doesn't wait for you to narrate what's happening. She sees the health bar, hears the combat music shift, and says something like “real subtle, walking into that grab attack” right as you roll away. When you finally get the kill five attempts later, her reaction lands at the exact moment the boss falls. No delay. No explanation needed. She was watching.
Screen Vision Beyond Games
The VLM reads any screen content — games are the primary use case, but not the only one where real-time visual context adds value.
Watching Movies and Shows Together
Screen vision activates for anything on your monitor, not just games. Your companion can watch shows with you, react to plot developments, and have a genuine post-episode discussion because they saw what happened rather than hearing your summary of it. The parasocial viewing experience gets a conversational partner who was actually there.
Creative Work and Productivity
Artists, writers, and developers have found uses for a companion who can see their work in progress. Your character can comment on a digital painting as you build it, react to code you're writing, or keep you company during long document work. The companion becomes part of your workspace rather than a distraction from it.
Streaming Reaction Content
Reaction streamers and IRL-adjacent content creators can use screen vision to give their AI companion genuine visual context. When your character reacts to content you're watching live, the reactions land with the same timing as yours. The dynamic feels collaborative rather than scripted, which reads very differently to an audience.
Why Questie Is the Only AI Companion with Real Gaming Screen Vision
Other AI companion platforms don't offer this feature. Here's what that gap means in practice.
Character AI Cannot See Your Screen
Character.AI is a text-based platform. Your companion has no awareness of your game state unless you describe it. Every reaction to gameplay requires manual input from you, which interrupts the flow of both gaming and conversation. Questie eliminates that friction — your companion watches alongside you so you can stay focused on the game.
Generic AI Assistants See Text, Not Gameplay
Some AI assistants can process screenshots when you manually attach them to a prompt. That's fundamentally different from a companion who automatically perceives your screen and responds in real time. Manual screenshot sharing requires you to stop, capture, attach, and explain — which destroys gaming immersion. Continuous screen vision makes your companion proactive rather than reactive to your prompts.
Screen Vision plus Memory Creates Actual Game Awareness
Screen captures without memory would give a companion awareness of right now. Memory without screen vision gives them awareness of your history but not the present moment. Questie combines both — your companion sees what's on screen and knows who you are and what you've discussed before. That combination is what produces responses that feel genuinely situationally aware rather than generically contextual.
AI Screen Vision: Common Questions
Technical details and usage questions answered.
What is AI screen vision in Questie?
AI screen vision in Questie is a feature that allows your companion to see and interpret what's on your monitor in real time using a Vision Language Model. The VLM periodically captures your screen, analyzes the visual content to understand the game state, and uses that understanding to generate contextually relevant voice responses. Your companion reacts to boss fights, inventory changes, cutscenes, and other on-screen events as they happen — without you needing to describe them.
How does the Vision Language Model work?
A Vision Language Model combines computer vision (understanding images) with language generation (producing text). It takes screen captures, identifies objects, UI elements, and scene context within the image, and generates a semantic description of what's happening. That description gets added to the context fed to your AI companion, so their response is informed by visual reality and not just the conversation history. Questie's implementation is optimized for gaming content — UI elements, health bars, inventory screens, and common game environment types.
Does screen vision affect game performance?
No. Screen vision runs as a background process that captures and analyzes screenshots — it doesn't hook into the game process, modify rendering, or affect your GPU's game workload. The capture frequency is designed to catch meaningful events without running constantly. The VLM analysis happens server-side, not on your local hardware. You should not see frame rate drops, stuttering, or input latency from screen vision being active.
What games does screen vision work with?
Screen vision works with any game running in windowed or borderless-windowed mode on your desktop. It reads the visual output, not game-specific data, so there's no integration required with individual games. RPGs, survival games, strategy titles, shooters, MOBAs, story-driven games, and simulation games all work. Fully exclusive fullscreen mode may limit capture capability depending on your OS configuration — windowed fullscreen (borderless) is the recommended mode.
Can I control when screen vision is active?
Yes. Screen vision is opt-in and controlled by a toggle during your session. You can turn it on at the start of a boss fight and off during loading screens or menus. When disabled, the screen capture process stops and your companion switches to voice-only mode based on conversation context. The toggle is immediate — no cooldown or restart required.
Does Questie store or record my screen captures?
No screen captures are stored beyond the current session's active context window. The VLM processes each capture to extract semantic information and then discards the raw image. Questie does not archive, log, or retain screenshots from your sessions. The analysis is used to generate your companion's contextual responses in the moment and nothing further.
Is screen vision better than describing my gameplay to the AI?
Significantly. Describing gameplay adds latency, interrupts your focus, and is inherently incomplete — you're narrating what happened, not what's happening right now. Screen vision eliminates that delay. Your companion reacts to the moment as it occurs rather than processing your retrospective description. The responses are also more accurate because there's no information lost in translation between what you saw and what you chose to type or say about it.
How does screen vision work for streaming on Twitch?
For streamers, screen vision means your AI co-host reacts to the same gameplay your audience is watching — at the same time. The reactions are genuine and timely rather than prompted. Boss kills, clutch moments, funny failures, and plot twists get real-time vocal reactions from your companion as they happen. This creates organic highlight moments that feel like co-op streaming rather than a streamer talking at a chatbot. The audio routes through OBS alongside your microphone for clean broadcast integration.
Explore More Features