SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Intel Corporation (INTC) -- Ignore unavailable to you. Want to Upgrade?


To: Barry Grossman who wrote (87793)9/5/1999 4:08:00 AM
From: Barry Grossman  Respond to of 186894
 
For those always wondering what a high performance computer will be able to do in the not to distant future, this might give them a taste. Long and very technical but some good stuff here:

developer.intel.com

Video As Input: an Opportunity Knocking

Video As Input (VAI) is an entirely new category of digital entertainment capability that delivers new computing techniques by addressing every industry level, from base platforms to end-user awareness. The Intel Architecture Lab (IAL) is working with associates that include PC game creators, toy manufacturers, and camera vendors to enable a class of applications that will support the PC + Camera to deliver new value to the end-user. At this juncture the full potential of VAI has not been tapped and a synergistic effort to develop the VAI capabilities will enhance the future of consumer VAI.

As part of the development of VAI capabilities, the Digital Entertainment (DE) initiative within IAL is developing segmentation, tracking and recognition technologies that can enable several new forms of interactive, digital play. Let's discuss how this emerging technology is being used today and the impact it can have in the future.

Introduction to VAI
The traditional trends of computing are output focused. Input devices, however, have barely been investigated. The original keyboard that was developed many decades ago is still the primary means of input today. With the appearance of the mouse, joystick and touch screens, the computer increased very little in awareness of its physical surroundings. In the last few years, however, microphones and video cameras have begun to be shipped with new computers, enabling a fundamental change in how computers can discern their local environment. Because humans view the world primarily by hearing and seeing, a shift away from more "touch" based input devices is needed. The PC does today, however, rely almost entirely on touch-like input devices.

Currently, most video technologies for the PC reside in codecs, conferencing, television, and media display. The amount of intelligent, semantic-based processing applied to the video stream is insignificant. The PC takes video in, changes representation (color conversion, compression), and displays it on the screen.

Video as input is about exploring the use of higher levels of cognitive examination of the video stream. It is about treating the camera (the eye) plus the PC (the brain) as an intelligent input device, rather than a simple streaming source. This input can allow the PC to react to environmental changes.

Some examples of how VAI applications can be used are:

Proximal Sensing/passive input?passive monitoring with sophisticated event triggering based on recognition, this technology could be used in security, home-office automation, etc.

Augmented Reality?capture video input, augment it and then display it back to the user, this could be used in games or immersive programming where the user inserts self into virtual locales (such as a ski slope or music video) and interacts with objects in virtual space.

Human-Computer Interface?integrates video into the primary interface to the PC and appropriate applications. A set of visual UI control primitives like buttons and sliders in Windows could be used as touch screens with standard monitors, "graspable" UI, and gesture-driven command and control for selected applications.

"Virtual Reality" based on video input?synthesize output (3D models and photo-realistic video) based on video (and other semantic) input - where emotions detected in speech would synthesize "avatars" or drive 3D models of virtual worlds or gesture-driven command and control gaming applications.

Desktop cameras are becoming ubiquitous, thanks to video conferencing and imaging applications. It is now easy to add cameras because of the United Serial Bus (USB) and applications are readily available on the market today, such as Intel© Create & Share? Camera Pak and Microsoft* NetMeeting. Trends in hardware infrastructure (higher bandwidth) are evolving daily. And with a widespread base of video cameras, a higher level of combined functionality can be achieved.

Year 2000 Technology Feasibility
The feasibility of utilizing VAI technology in the year 2000 is very favorable. The segmentation, motion-detection, and object tracking, and certain types of identification technologies in VAI are in prime-time readiness status for non-mission critical applications. Let's go in greater detail to understand how these may be used in 2000 and the possibilities for the future.

Figure 1 depicts the algorithm maturity and MIPS usage of each type of VAI functionality. It is important to realize that increased algorithm maturity increases robustness to environmental changes and accuracy in VAI applications. And with increased MIP usage a higher volume platform will be required to run the application. As we discuss each technology, refer back to this chart to distinguish algorithm maturity and MIP usage.

Figure 1. Year 2000 Technology Feasibility

This paper will only be focusing on the year 2000 sweet spot, that is, the basic "Pixel-Scale" and "Segmentation & Tracking" technologies depicted in the following figure.

Automatic Video Segmentation
Automatic video segmentation refers to the act of identifying portions of a video frame that belong to an object without any user intervention or special scene preparation (as in blue-screen chroma-keying used in professional productions). This task is probably the most fundamental vision task in the sense that it allows applications to interpret the video input as a collection of objects rather than a matrix of pixels. The most common use of segmentation is in applications where moving foreground objects are differentiated from stationary background objects where they are using motion as a cue. Therefore segmentation is often also called foreground/background segmentation.

From a usage model viewpoint, segmentation algorithms can be classified into two main categories. The first one being, the algorithms that require a stationary background is available before segmentation commences. The second one being the set of techniques that can detect the background in the course of segmentation. In general, the second category of algorithms is less mature and consumes greater computational resources. Consequently, applications that can afford the start-up cost of calibration generally use the first category of segmentation algorithms (see Figure 1).

Segmentation has many applications in a variety of video processing techniques including compression. Of particular interest to VAI is the use of segmentation to provide an immersive experience for users. In such applications, segmentation is used to extract the subject from their locale and render it into a virtual locale. This capability leads to many immersive gaming and creativity applications such as the one from Sabbatical, inc.* illustrated in the Figure 2.

Figure 2. Sabbatical, inc. Example

Several segmentation solutions have already entered the realm of real-time operation today. Realities Fusion, ePlanet, and Sabbatical, inc. have Software Developer Kits (SDKs) and/or applications that demonstrate the range of VAI capabilities that segmentation can enable. In general, all these implementations of segmentation can provide 20+ fps performance with 160x120 video on Intel© Pentium© II processor class machines. Off course, higher performance CPUs allow these algorithms to achieve higher frame-rate or to segment larger video frames at the same frame rate.

Motion Detection
Motion detection refers to the task of identifying temporal change in a video stream associated with object motion in specified areas of the video frame. This capability can be implemented with a range of features covering simple motion detection algorithms (providing a yes/no classification) to more capable algorithms with the ability to provide information about direction and velocity of object motion. Motion detection is frequently used to implement hot spots for interactive VAI applications. When combined with segmentation, motion detection can be used to provide interactive experiences in immersive (virtual) spaces. For example, a user may be immersed in a virtual room where he or she may interact with virtual objects (implemented as hot spots) using motion detection.

The simple motion detection algorithms use different forms of pixel differences; such algorithms can easily be performed in realtime with modest computing resources (Intel© Pentium© processor class PCs). The more complicated motion detection algorithms that provide speed and directional information use motion estimation or optical flow techniques, and are considerably more expensive to implement. Depending on the number of hot spots defined in the video, such algorithms may easily require the power of a Intel© Pentium© III processor class machine in order to achieve an acceptable fps (see Figure 1).

Object-Tracking and Identification
Segmentation provides VAI applications with the basic capability to address video data in terms of "objects" instead of "pixels." An interesting class of VAI applications results from being able to detect and to react to the semantic behavior of such objects. Two such semantic tasks that are feasible today are tracking and recognition of human forms.

Tracking in the context of human form tracking usually refers to identifying and following interesting points of a human body such as the head, the hands, the elbow, the torso, and leg extremities. A good example of tracking technology is seen in the head-tracker capability using the Continually Adaptive Mean SHIFT (CAMSHIFT) algorithm. 2D tracking (involves tracking points in the 2D image plane of a single camera) provides the ability to drive synthetic models facilitating applications such as avatar-based communications. Such a capability can also be used to implement full or partial body gesture recognition algorithms by associating the temporal behavior of the chosen tracking points with distinct gestures. 3D tracking (where each tracked point is located in terms of a 3D space centered on the camera) usually requires stereo cameras (in order to provide depth information). While 3D tracking is more accurate and enables a wider range of application capabilities, it probably will not be deployed in volume in the year 2000, due to its dependence on stereo cameras.

Recognition is a generic task and probably most commonly used in the context of face recognition and gesture recognition (discussed above). Face recognition is an interesting and powerful technology that enables a wide variety of security (passface instead of a password) and personalization applications. A good example of commercially viable face recognition technology can be found in the FaceIt* product from Visionics. This technology can be effective on today's PCs and promises to scale with available processing power (larger facial databases, smaller recognition times, and recognition in presence of incomplete or occluded face data) in 2000.

Break Barriers in VAI Architecture
The technology feasibility of VAI is a very real potential and although it is in it's infancy, it has sufficient traction to create impact in the 2000 time frame. While providing exciting technologies, there is the need to be careful about creating barrier-ridden architectures. Even in its infancy, developers can take the right steps to break down wide-spread barriers. There are two types of architectures in existence today and the following descriptions will make the synergistic effort required to break down these barriers very apparent.

Monolithic Architectures
Monolithic architectures produce closed applications. This is prone to happen in cases where one developer builds the entire application stack. The result of proprietary vision algorithms is that they can only be accessed via custom APIs and custom APIs are very difficult to propagate to the enduser. This also results in vision algorithms that are tuned for specific cameras that the general population does not own or use. Another result of monolithic architectures is that only custom camera APIs can be used for controlling capture parameters that are associated with use of dated video processing infrastructures, such as Microsoft's* Video For Window* application. Proliferation of closed applications creates barriers for the widespread deployment of VAI applications in the future.

Modular Architectures
In contrast, modular architectures create open applications. This results in applications that partition capture, vision, and display processing. The importance of partitioned tasks is that they interact via known or query-compliant interfaces, such as the video capture driver that is compliant with Microsoft's Windows Driver Model* (WDM). Also the entire processing chain is implemented via Microsoft's DirectShow* architecture, where vision components are filters with COM-like, query-supported interfaces.

The modular architecture advantages are that they:

Leverage known interfaces for capture technology, where the application may work with a wider base of cameras. Custom interfaces are no longer a barrier for using a camera.

Allow extensible use of vision technologies where minimal re-engineering to upgrade vision processing modules occurs.

Motivate development of robust, camera-independent vision processing algorithms.

Allow common functions to be re-used that spur faster and efficient development of a "product line." As a result, new applications do not have to be written "from scratch."
Camera Requirements
Camera interfaces as described above are important, but not the only issue. Video quality is also an important consideration. The requirements of video quality vary depending on application driven tasks. Not all cameras are created equal from a VAI perspective. In monolithic architecture cameras are created to solve the vision technology within a specific application. In other words, the camera's requirements meet task-based solutions resulting in custom interfaces. From the consumer viewpoint, a new camera would have to be purchased with every new application for optimal results. In order to provide VAI capable cameras, an understanding of vision processing is essential. This understanding will produce necessary interface hooks in the camera that will solve the "one application, one camera" problem for the future.

Video quality requirements vary depending on the task and its application. Some of the camera attributes of interest are:

Video Resolution?higher resolution provides more details for object feature detection.

Minimum resolution for 2-feet applications is 160X120.

Minimum resolution for 10-feet applications is 320X240.

Color?color provides essential information to many recognition tasks, e.g., face or hand detection.

Color space is less important than color fidelity.

Spatial sub-sampling in the sensor and interpolation must synchronize.

Field of View?must accommodate 75% of adult human body at 10 feet. Geometric distortions in lens can affect domain-knowledge-based vision tasks.

Sensor and Capture Noise?noise is a signal to many vision applications (motion detection, feature extraction).

SNR greater than 40 db is recommended for today's technologies.

Temporal "flickering" is an important mitigating factor.

Low Light Sensitivity?important for "home" applications. SNR greater than 40 db at 3 Lux illumination at room temperature (3K K).

Frame-rate and Driver Latency?very important for interactive applications, e.g., gesture-driven interfaces.

20-fps minimum; 30+ fps is recommended.

Latency must be less than 30 milliseconds.

Frame-rate "jitter" must be less than 5 milliseconds.

Camera Interface Requirements?programmatic interfaces must allow applications to control the camera parameters.

Brightness

Contrast

Hue

Saturation

White Balance

Auto Focus

Exposure

Frame-rate

Color-format

Stakeholders' Call to Action
Vision Technologists & Solution Providers
Creating cameras of the future for the consumer PC can be accomplished by defining objective measures (possibly task-based) of video performance to assist camera vendors to develop "VAI-friendly" multi-function video.

Also concentrating on developing robust algorithms capable of working with different cameras and in less constrained environments, i.e. transitioning from task-based solutions to SDKs would create VAI friendly applications.

This would also include developing scalable algorithms that not only demonstrate the capabilities of tomorrow's high-end processors, but also continue to deliver enduser value across the spectrum of available computing power today. Develop algorithms written to the highest processor available that will also scale down to meet the needs of the lowest to expand the target audience.

Camera Vendors
Work with vision algorithm experts to understand and create objective video performance requirements for VAI. When advertising camera performance to application developers, publicize VAI specifications to enable the developers to build the application according to the specification. This will allow them to choose the "right" camera for the VAI application.

Abandon custom interfaces for camera drivers. Use publicized standardized interfaces that enable standardized applications. Work with interface vendors to extend interfaces (e.g. provide WDM-compliant drivers under DirectShow*).

Understand vision-processing requirements and provide all necessary interface hooks in the camera driver. One example would be to provide the ability to turn off the automatic white balancing feature.

Conclusion
Using VAI to create new ways to use visual computing has benefits for all that take the opportunity. VAI enables a class of applications that leverage the "PC + camera" to deliver new value to the enduser. Although VAI is an emerging technology area in its infancy, it has sufficient traction to create impact in the year 2000. VAI is an opportunity knocking, and tapping the full potential of VAI will require an industry-wide synergistic effort of all those who open the door to this beneficial challenge.

For More Information
The DE initiative is providing this list of vision technology solution providers to enable stakeholders and camera vendors the proof that solutions do actually exist today and are increasing quickly into 2000. This list will also provide an opportunity to glance at the next level of tool and technology solutions. Take a look at what these companies are working with and the breadth of application and usage models that are available.

Vision Technology Solution Providers:

ePlanet

RealityFusion

Microsoft

Vivid Group

Visionics

Sabbatical, inc.

Intel



To: Barry Grossman who wrote (87793)9/5/1999 8:50:00 AM
From: Fred Fahmy  Read Replies (1) | Respond to of 186894
 
Barry,

Thanks for the Craig B. article regarding the internet and E-commerce. That one is definitely a "keeper". It amazes me how so many people still don't fully appreciate how significant the internet is and more importantly how significant it will be going forward. It's very reassuring to know that Intel's leaders are not in this group of people <gg>. IMO, the internet is on par with or possibly even more significant than the "industrial revolution".

BTW, does anyone else here think that SI should add "internet" and "E-commerce" to their check-spelling dictionary?? It is amazing that on this site in particular, "internet" would not be a recognized word.

Thanks again,

FF



To: Barry Grossman who wrote (87793)9/5/1999 2:58:00 PM
From: Brian Malloy  Respond to of 186894
 
I would agree, the world is changing in many aspects.

One of the neat things is that as long as one has access to a computer such as at your local library or some type of terminal you can have a presence on the net.

Free email, calendars, web pages through Yahoo/Geo Cities and others. Sites that offer free storage on the web.
Message 11163796

Perhaps intel will provide some server farm disk space to its DRIP shareholders. Say one one meg for every ten shares. John Hull are you listening?