Case Study 2: Seeing AI by Microsoft — Computer Vision as Accessibility

DataField.Dev

Case Study 2: Seeing AI by Microsoft — Computer Vision as Accessibility

Introduction

In the chapters of this textbook, computer vision appears most often as a tool for efficiency, optimization, and revenue. It detects out-of-stock products, catches manufacturing defects, and screens medical images. These are important applications. But they do not capture the full scope of what becomes possible when machines learn to see.

Seeing AI, a free smartphone application developed by Microsoft, uses computer vision to describe the visual world to people who are blind or have low vision. Point the phone's camera at a person, and it speaks their name (if known), estimated age, and emotional expression. Point it at a product in a store, and it reads the barcode and identifies the item. Point it at a printed page, and it reads the text aloud. Point it at a restaurant menu, a street sign, a handwritten note, or a scene in a park, and it describes what it sees.

Seeing AI is not Microsoft's most commercially significant AI product. It generates no direct revenue. But it may be Microsoft's most important AI product — because it demonstrates that computer vision's value cannot be measured solely in dollars.

This case study examines Seeing AI as a technology achievement, a design challenge, an accessibility innovation, and a lens through which to understand computer vision's broader potential for social impact.

The Problem: Navigating a Visual World Without Vision

Approximately 2.2 billion people worldwide have some form of vision impairment, according to the World Health Organization. Of these, approximately 43 million are blind and 295 million have moderate to severe visual impairment. In the United States alone, approximately 7.6 million adults report significant vision loss.

For people who are blind or have low vision, the physical world is full of information that is designed exclusively for sighted people:

Printed text — menus, signs, instructions, mail, labels, expiration dates
Product identification — distinguishing between similar containers (shampoo vs. conditioner, sugar vs. salt) in a kitchen or store
Currency — US bills are identical in size and texture regardless of denomination
Social cues — facial expressions, body language, who is in the room, who is speaking
Navigation — street signs, building numbers, landmarks
Documents — contracts, forms, handwritten notes

Before AI-powered tools, people with visual impairments relied on a combination of human assistance (sighted guides, phone-based services like Be My Eyes where volunteers describe what a camera shows), assistive devices (screen readers, magnification tools, Braille displays), and memorization strategies (organizing products by position, folding bills differently by denomination).

These strategies work, but they impose a persistent cognitive burden and dependency on others. The question Saqib Shaikh, the project's creator, asked was simple and profound: What if your phone could see for you?

The Creator: Saqib Shaikh

Seeing AI's story begins with its creator — a detail that matters because it illustrates a principle the chapter identifies: the most impactful AI applications often come from people who deeply understand the problem they are solving.

Saqib Shaikh is a software engineer at Microsoft who lost his sight at age seven. He attended a school for the blind, learned to code using a screen reader, earned a computer science degree, and joined Microsoft in 2006. For over a decade, he worked on various AI and cloud computing projects while personally experiencing the daily challenges that visual impairment creates.

Shaikh first demonstrated a prototype of Seeing AI at Microsoft's Build developer conference in 2016. The demo was simple: he pointed his phone at a colleague, and the app said, "I think it's a man about 40 years old. He looks happy." The audience reaction was immediate and emotional. The project received internal support from Microsoft's CEO Satya Nadella, who had made accessibility a strategic priority.

Seeing AI launched as a free iOS app in July 2017 and has since been downloaded millions of times. Shaikh continues to lead development while also working on broader accessibility initiatives at Microsoft.

Business Insight: Seeing AI was created by someone who lives with the problem it solves. This is not a coincidence. User-centric AI development — building with the affected community, not just for them — produces better products. Shaikh did not need to conduct user research to understand the pain points of visual impairment. He experienced them every day. The lesson for business leaders: the most transformative AI applications often emerge when the people building the technology are also the people who need it.

The Technology: Seven Channels of Sight

Seeing AI organizes its capabilities into distinct "channels," each using different computer vision techniques to address a specific need:

Channel 1: Short Text

The camera continuously scans for text in the environment — signs, labels, buttons on an elevator, the name on a coffee cup. When text is detected, the app reads it aloud immediately, without requiring the user to take a photograph.

CV technology: Optical Character Recognition (OCR) operating in real-time on a video stream. The app uses text detection to locate text regions, then OCR to decode the characters, then text-to-speech to vocalize the result. The challenge is speed and accuracy across diverse fonts, sizes, orientations, and lighting conditions.

Channel 2: Documents

When pointed at a printed page, the app captures the entire document, guides the user to position the camera correctly (using audio cues like "move right," "move down"), and reads the full text with proper formatting and paragraph structure.

CV technology: Document layout analysis + OCR. The system must distinguish between text columns, headers, captions, and body text, then read them in the correct order. This is more complex than short text detection because document structure matters — reading a two-column newspaper page requires understanding the column layout.

Channel 3: Products

The user scans product barcodes by sweeping the phone camera across items. The app reads the barcode and looks up the product name, brand, and other details from a product database. Audio tones guide the user to center the barcode in the camera frame.

CV technology: Barcode detection and decoding. While technically simpler than image classification, the UX challenge is significant — the user cannot see the barcode, so the app must provide spatial audio guidance ("move the phone left and down") to help them locate it.

Channel 4: People

The app can detect faces in the camera's field of view and, if the user has trained it on photos of friends, family, and colleagues, identify them by name. For unrecognized faces, it provides descriptions: estimated age, gender presentation, and emotional expression.

CV technology: Face detection + face recognition (for trained faces) + attribute estimation (age, expression). Microsoft uses its Azure Face API, which has been the subject of significant ethical debate. In 2020, Microsoft announced it would no longer sell facial recognition technology to police departments and removed the age and gender estimation features from the general-purpose Face API, though they remain available in Seeing AI as an accessibility accommodation.

Channel 5: Currency

The app identifies US paper currency denominations — a critical need because US bills are the same size regardless of value, unlike the currencies of many other countries that vary in size and color by denomination.

CV technology: Image classification trained specifically on US currency denominations. The model must work under varying lighting conditions and with bills in different states of wear.

Channel 6: Scenes

When the user takes a photograph of their environment, the app generates a natural language description of the scene. "A park with a path and trees. Two people sitting on a bench. A dog on the grass."

CV technology: This is the most technically sophisticated channel, combining object detection (identifying entities in the scene), spatial relationship understanding (the bench is near the path), and natural language generation (producing a coherent description). Recent versions leverage multimodal AI models — the same type of models discussed in Chapter 18 — to generate richer, more contextual descriptions.

Channel 7: Color

The app identifies the dominant color in the camera's field of view, useful for selecting clothing, identifying objects by color, or navigating color-coded systems.

CV technology: Color analysis and classification, mapping pixel color values to named colors.

Research Note: Seeing AI's channel architecture is a masterclass in product design for AI applications. Rather than presenting computer vision as a monolithic capability, the app breaks it into specific, task-oriented functions that map to real user needs. Each channel has a defined purpose, predictable behavior, and clear limitations. This task decomposition is a model for any business deploying AI: do not build "an AI system" — build specific capabilities that solve specific problems.

Seeing AI's design challenges go far beyond the computer vision algorithms. The app must be usable by people who cannot see the screen — a constraint that fundamentally reshapes every design decision.

Audio-First Interface

Every interaction in Seeing AI is mediated through audio. There is no visual UI that matters. The app uses:

VoiceOver (Apple's built-in screen reader) for navigation between channels and settings
Spoken descriptions for all CV outputs
Audio tones for spatial guidance (helping users aim the camera at targets they cannot see)
Haptic feedback for confirmations and alerts

This audio-first design is not an adaptation of a visual interface. It is a natively audio interface designed from the ground up. The distinction matters: many "accessible" apps are visual apps with accessibility features bolted on. Seeing AI is an accessibility app with no visual dependency.

Graceful Uncertainty

Computer vision models produce probabilistic outputs — confidence scores that indicate how certain the model is about its prediction. For a sighted user viewing results on a screen, a confidence score of 0.72 can be displayed alongside the result, and the user can visually verify whether the prediction seems reasonable.

For a blind user, this is not possible. The app must decide when to speak with confidence ("This is a can of Coca-Cola") and when to express uncertainty ("I think this might be a can of soda"). Getting this calibration right is critical: overconfident statements about low-confidence predictions could lead users to make incorrect decisions (misidentifying a product, misreading medication instructions), while excessive hedging reduces the app's utility and the user's trust.

Business Insight: Graceful uncertainty management is relevant far beyond accessibility applications. Any AI system that communicates its outputs to non-expert users — customer service chatbots, diagnostic tools, recommendation engines — must calibrate how it expresses confidence. "The model says this with 73% confidence" is meaningful to a data scientist and meaningless to everyone else. Designing for appropriate certainty expression is a UX discipline that most AI products underinvest in.

Speed and Reliability

For Seeing AI to be useful in daily life, it must work quickly and reliably in uncontrolled environments — kitchens, stores, offices, streets, restaurants. This means:

The app must handle variable lighting (bright sunlight, dim interiors, fluorescent overhead)
Processing must be fast enough for real-time use (the short text channel operates continuously)
The app must work with one hand (the user may be holding a cane, carrying groceries, or navigating an unfamiliar space)
Battery consumption must be manageable (always-on camera processing drains batteries quickly)

These constraints drove Microsoft toward a combination of on-device processing (for speed and privacy) and cloud processing (for more complex tasks like scene description), anticipating the hybrid edge-cloud architecture discussed in the chapter.

Impact: What Changes When the Phone Can See

The impact of Seeing AI extends beyond the practical capabilities of its individual channels. The app changes the relationship between a visually impaired person and their physical environment in ways that are difficult to quantify but impossible to overstate.

Independence

A survey of Seeing AI users conducted by Microsoft found that the most frequently cited benefit was not any single feature but the reduction in dependency on sighted assistance. Users reported:

Reading their own mail without asking a family member
Identifying products in a grocery store without requesting help from store staff
Navigating unfamiliar buildings by reading signs
Participating in meetings where visual presentations were shared
Choosing clothing without assistance

Each of these represents a moment where a person can act independently rather than waiting for or requesting help. The cumulative effect on dignity, confidence, and quality of life is substantial.

The people recognition channel changes social dynamics for blind users. Knowing that a colleague has entered the room, recognizing a friend on the street, understanding that the person you are speaking with appears confused or amused — these are social cues that sighted people process unconsciously and that blind individuals must navigate without.

"Before Seeing AI, walking into a crowded room was disorienting," one user told Microsoft in a testimonial. "I didn't know who was there until they spoke to me. Now I can point my phone around the room and know that my boss is near the window and my friend is by the coffee machine. It changes the social equation completely."

Employment

Visual impairment is correlated with significantly lower employment rates. In the US, approximately 44% of working-age adults who are blind are employed, compared to 77% of the general population. While many factors contribute to this gap, the inability to access visual information in the workplace — documents, presentations, product samples, physical environments — is a significant barrier.

Seeing AI and similar tools reduce this barrier. Several users have reported that the app enabled them to perform job functions that previously required sighted assistance:

Retail workers reading price tags and product information
Office workers scanning printed documents and reading whiteboards
Warehouse workers identifying packages and reading labels
Educators reading printed materials during class preparation

Business Insight: Accessibility AI is not charity technology — it is workforce enablement technology. The 7.6 million working-age Americans with significant vision loss represent an underutilized talent pool. Companies that deploy accessibility tools — for employees and customers — access talent and markets that competitors overlook. The Americans with Disabilities Act requires reasonable accommodations; AI-powered tools like Seeing AI make many more accommodations reasonable.

The Business Model: Why Free?

Seeing AI is free. There are no subscription fees, no in-app purchases, no advertisements. Microsoft bears the development, cloud computing, and maintenance costs entirely. Why?

Strategic Reasons

Platform demonstration. Seeing AI showcases the capabilities of Microsoft's Azure AI services — Computer Vision API, Face API, Custom Vision, and Cognitive Services. Every successful Seeing AI interaction is an implicit advertisement for the Azure platform that enterprise customers pay billions of dollars to use.

Talent attraction. AI researchers and engineers want to work on meaningful problems. Seeing AI is one of the most visible examples of AI used for social good, and Microsoft has stated that it is a significant factor in recruiting mission-driven technical talent.

Brand halo. In an era when technology companies face public criticism over privacy, bias, and monopolistic practices, Seeing AI provides an unambiguous positive narrative. Microsoft's communications team regularly features Seeing AI in corporate responsibility messaging, investor presentations, and government relations.

Accessibility ecosystem. Microsoft's broader accessibility strategy — which includes features across Windows, Office, Teams, and Xbox — positions the company as the technology provider most committed to inclusive design. Enterprise customers with accessibility requirements (all US federal agencies, many large corporations) factor this commitment into procurement decisions.

The Broader Market

The assistive technology market — aids for people with disabilities — is estimated at $30-40 billion globally and growing. While Seeing AI itself does not charge users, it creates a halo effect that benefits Microsoft's entire product ecosystem and positions Azure's AI services as the platform of choice for assistive technology developers.

Research Note: The intersection of AI and accessibility is an active area of research and product development. Google's Lookout app provides similar functionality to Seeing AI on Android. Apple has built object recognition directly into its accessibility features. Startups like OrCam have developed wearable devices (glasses-mounted cameras) that provide real-time visual description. The competitive dynamics suggest that the major technology platforms view accessibility AI as both a social responsibility and a strategic asset.

Limitations and Challenges

Seeing AI is impressive but not perfect. Its limitations illuminate broader challenges in computer vision.

Accuracy in Uncontrolled Environments

The app works well in controlled conditions — good lighting, clear text, standard products, familiar faces. Performance degrades in challenging environments:

Low light: Camera quality drops, OCR accuracy falls
Cluttered scenes: Scene descriptions become vague when many objects are present
Non-standard text: Handwriting recognition is significantly less accurate than printed text recognition
Unfamiliar products: Products not in the barcode database cannot be identified

These limitations mirror the broader challenge discussed in the chapter: computer vision models are trained on specific data distributions and perform less reliably when the real world deviates from training conditions.

Bias in Description

When Seeing AI describes a person — estimated age, gender presentation, emotional expression — it applies categorical labels that are inherently reductive and potentially biased. The chapter's discussion of Joy Buolamwini and Timnit Gebru's research on facial analysis disparities is directly relevant: if the underlying models have higher error rates for certain demographic groups, the app may describe some people less accurately than others.

Microsoft has partially addressed this concern by removing demographic estimation features from its general-purpose Face API while retaining them in Seeing AI, arguing that the accessibility use case justifies the capability. This decision illustrates the tension between legitimate use cases and systemic bias risks — a tension that has no clean resolution.

Privacy of Described Individuals

When a Seeing AI user points their phone at a stranger and the app describes that person's appearance, estimated age, and emotional expression, the described individual has not consented to this analysis. In most jurisdictions, photographing people in public spaces is legal, but AI-powered facial analysis adds a layer of information extraction that exceeds what a casual glance provides.

This creates an unusual ethical configuration: the user has a legitimate accessibility need for information about the people around them. The described individuals have a legitimate interest in not being subjected to AI facial analysis without consent. The two interests are in tension, and Seeing AI navigates this tension by limiting what information it provides (descriptions, not identification of strangers) and processing images locally when possible.

Language and Cultural Limitations

As of 2025, Seeing AI's full feature set is available primarily in English, with limited support for other languages. Scene descriptions, in particular, reflect the cultural context of the training data — predominantly Western, English-speaking environments. The app may describe scenes less accurately or less relevantly for users in non-Western contexts.

Lessons for Business Leaders

Seeing AI offers several insights that extend well beyond accessibility technology:

1. The most valuable AI applications solve problems that non-technical people experience. Shaikh did not start with "what can computer vision do?" He started with "what do I need that I cannot see?" The best business AI projects follow the same logic: start with the human problem, then determine whether CV (or any AI technology) can address it.

2. Task decomposition improves both usability and reliability. Seeing AI's channel architecture — short text, documents, products, people, currency, scenes, color — breaks a general capability (computer vision) into specific, reliable functions. Business CV systems should follow the same principle: build specific, well-defined tools rather than general-purpose "AI vision" systems.

3. Confidence calibration is a design problem, not just a technical one. How an AI system communicates its uncertainty to users determines whether the system is trusted and useful. This applies equally to accessibility tools, medical diagnostic aids, customer-facing recommendation engines, and executive dashboards.

4. Free can be a strategy. Microsoft gives away Seeing AI because the strategic value — platform demonstration, talent recruitment, brand enhancement, accessibility ecosystem — exceeds the development cost. Business leaders should consider whether some AI capabilities create more value as free offerings (driving platform adoption, generating data, building trust) than as revenue-generating products.

5. Representation in development teams matters. Seeing AI was created by a blind engineer. This was not incidental to the product's quality — it was foundational. Teams that include people with direct experience of the problem they are solving build better products. This principle applies broadly: AI systems for healthcare benefit from clinical input, AI systems for manufacturing benefit from operator input, and AI systems for customers benefit from customer involvement in design.

Discussion Questions

Seeing AI uses face recognition to identify people the user has trained the system on (friends, family, colleagues). The chapter argues that facial recognition in retail settings is ethically problematic. Is the use case in Seeing AI fundamentally different? Why or why not?
Microsoft removed demographic estimation (age, gender) from its general-purpose Face API but retained it in Seeing AI. Evaluate this decision. Is the accessibility justification sufficient? What additional safeguards, if any, would you recommend?
If you were designing a Seeing AI equivalent for the workplace, what additional channels or features would you prioritize? Consider the specific visual information challenges that blind employees face in office, retail, healthcare, and manufacturing environments.
The chapter discusses bias in computer vision systems. How should Seeing AI address the possibility that its person descriptions are less accurate for certain demographic groups? What are the consequences of inaccurate descriptions for a blind user?
Microsoft makes Seeing AI available for free. If a competitor launched a superior paid alternative ($9.99/month), how would the market dynamics play out? Is there a willingness-to-pay for accessibility AI, and if so, among whom — individual users, employers, insurers, or governments?

This case study connects to the cloud vision API discussion in Chapter 15, the multimodal AI capabilities explored in Chapter 18, the bias and fairness analysis in Chapter 25, and the responsible AI frameworks in Chapter 30.

Case Study 2: Seeing AI by Microsoft — Computer Vision as Accessibility

Introduction

The Problem: Navigating a Visual World Without Vision

The Creator: Saqib Shaikh

The Technology: Seven Channels of Sight

Channel 1: Short Text

Channel 2: Documents

Channel 3: Products

Channel 4: People

Channel 5: Currency

Channel 6: Scenes

Channel 7: Color

Design Principles: Building for Blind Users

Audio-First Interface

Graceful Uncertainty

Speed and Reliability

Impact: What Changes When the Phone Can See

Independence

Social Interaction

Employment

The Business Model: Why Free?

Strategic Reasons

The Broader Market

Limitations and Challenges

Accuracy in Uncontrolled Environments

Bias in Description

Privacy of Described Individuals

Language and Cultural Limitations

Lessons for Business Leaders

Discussion Questions