About | Project Members | Research Assistants | Contact | Posting FAQ | Credits

WordsEye: An Automatic Text-to-Scene Conversion System

Research Report by Nicole Starosielski
(created 3/13/07; version 1.0)
[Status: Draft]

Related Categories: Text and Multimedia | Alternative Interfaces | Text visualization


WordsEye is a text-to-scene conversion tool that allows users to construct a computer modeled scene through the use of simple text. Users describe an environment, objects, actions and images, and WordsEye parses and conducts a syntactic and semantic analysis of these written statements. The program assigns depictors for each semantic element and its characteristics and then assembles a three-dimensional scene that approximates the user’s written description. This scene can then be modified and rendered as a static two-dimensional image.

The motivation behind the creation of WordsEye is reducing the large investment of time required by most graphics programs to create a three-dimensional scene. WordsEye is easy to learn and use and operates through the medium of simple text. Its creators argue that it “should be possible to describe 3D scenes directly through language, without going through the bottleneck of menu-based interfaces” [1]. Like the creation of CaveWriting 2006, the development of WordsEye is driven by the desire to make 3D artistic creation on a computer more accessible and more transparent to the average user.

While using text in the creation of graphics is not new, WordsEye is the first program that offers users a blank slate on which the user can generate a picture through words. It accepts a wide range of input, including spatial relationships, textures, orientations and sizes of objects, actions performed by these objects, as well as light sources.

Figure 1. “Four penguins face Santa Claus. Santa Claus is one foot in front of the log cabin. Santa Claus faces the log cabin. There is a red bag one foot to the left of Santa Claus. The ground is icy. There are trees behind the log cabin.”

Once the user enters the simple text, she has the option of substituting different objects for the ones selected by WordsEye (there are currently twenty-three different models for “house”). Users can also choose a camera position and angle, zoom in and out of the scene, and alter the objects and properties at will. Two-dimensional effects and tools can also be applied (these include “ink,” “watercolor,” “blur,” “brightness,” and “contrast.”) When the user is done, she can submit the text for final rendering and the high quality picture is saved in an online portfolio.

Figure 2. Options for “house.”

WordsEye often adds details which are not explicitly described in a given scene. For example, if an object is simply described (e.g. “The car is big”), WordsEye will occasionally supply a background or a supporting object. Objects will always be a color, even if it is not specifically indicated. The creators of WordsEye describe the program’s potential ability to infer specific actions from general descriptions (e.g. “The man goes to the store,” may be interpreted as “The man walks to the store.”) or supporting instruments from actions (e.g. representing a car in the sentence, “Ellen drove to the bank”). Further research is being conducted into ways the environment, including rooms, seasons and times can be inferred from the semantic context [2].

The creators also describe a number of techniques that transform abstractions or non-physical properties into depictable entities. For example, WordsEye could generate an image that symbolizes a word: for example, a light bulb would represent the word “idea.” When the word “think” or “believe” is used, WordsEye might place a thought bubble above the character’s head in the convention of a comic book. When the word “not” is used, WordsEye depicts a large red circle with a slash through it in front of the image which is being negated. As WordsEye is still being developed, only some of these strategies have been realized. Currently, when there is absolutely no other way to depict a word, WordsEye will generate a 3D version of the word and situate the text itself in the scene.

The present version of WordsEye only creates static scenes. The creators argue that this enables them to focus on semantics without having to deal with problems in the generation of convincing animation. They also acknowledge that the high level of abstraction characterizing linguistic description results in a level of unpredictability. Indeed, while the simple creation of a 3D scene is easy, it is much more difficult to construct a pre-imagined scene with the text. However, WordsEye is not intended to substitute for traditional 3D software tools, but rather to augment them by enabling users to quickly set up scenes which can then be refined. Possible applications for WordsEye creations include use in electronic postcards, visual chat, instant messaging, gaming and virtual reality, education, design, and image search.

Research Context:
WordsEye was initially created by Bob Coyne and Richard Sproat at AT&T Labs. It is currently under development by Coyne’s and Sproat’s company, Semantic Light. The tool’s development occurs in the broader context of text-to-scene conversion, an interdisciplinary research area which draws knowledge from the fields of 3D Computer Graphics, Computational Linguistics, Cognitive Psychology and Knowledge Representation [3]. Its predecessors include the Put system, which facilitated graphics creation by allowing users to spatially designate and organize objects in an environment. However, the Put system was limited to pre-existing elements and worked with a small subset of English expressions and a rigid syntax [4].

Since WordsEye’s creation, other text-to-scene conversion programs have been further developed. One of these, Story Picturing Engine, assesses the best picture to illustrate a body of text [5]. Another, CarSim, is an example of the expansion of this field into the visualization of road accident reports. CarSim analyzes linguistic descriptions of car accidents and constructs 3D renderings [6]. These developments in text-to-scene conversion point to the diversity of its field of application.

Technical Analysis:
There are several steps to WordsEye’s text-to-scene conversion process. In the first phase, the user’s text is tagged and parsed using a part-of-speech-tagger and a statistical parser [7]. The output of this is a parse tree that represents the sentence’s structure. This tree is then converted into a “dependency representation,” a list of the words that reveals how each word is dependent on others (this makes subsequent semantic analysis more convenient). The dependency structure is then converted into a semantic representation: a description of all of the things to be depicted in the scene and their relationships. This conversion employs a variety of semantic interpretation frames including WordNet, a well known database of semantic relationships between words (used here in used in the organization of nouns) and FrameNet (used primarily in the organization of verbs). As of 2001, WordsEye has semantic entries for about 1300 English nouns, 2300 verbs, most prepositions and a few depictable adjectives. The creators of WordsEye note that, unlike recent work in Message Understanding, WordsEye performs a relatively deep semantic analysis which can then be used to answer a wide range of questions.

The second phase of the process is the translation of each semantic element to a set of depictors representing 3D objects, poses, spatial relations, color attributes, visibility, size, etc.. WordsEye (as of 2001) utilizes approximately 2000 3D polygonal objects. Since WordsEye is extensible, users can add their own objects to the database. Additional information, including skeletal control structures, color parts, opacity parts, spatial tags and functional properties, are also associated with each of these 3D models. A set of depiction rules are responsible for finding the most appropriate pose for each action given the context of all semantic elements. First, the depictors are applied, while maintaining their constraints, to build up the scene, then the background environment, the ground place, and lights are added. Lastly, the camera is adjusted [8].

Evaluation of Opportunities/Limitations for Transliteracies Topic:

Text-to-scene conversion is significant to Transliteracies as an interdisciplinary field concerned with the analysis of written text and the translation of this text into imagery. WordsEye represents one innovative model which is particularly user-friendly, accessible, and geared towards a creative end. It can be accessed at www.wordseye.com by anyone with an internet connection. There are no software downloads or additional hardware requirements. Its potential range of application falls within many more “traditional” online reading and writing domains, such as instant messaging and electronic postcards.

The program attempts to be as transparent and as close to the interpretation of natural language as possible, as the taglines “watch your language” and “type a picture” testify. However, the ambiguities of language and the limits of semantic interpretation make this more of a goal than a realized process. Creating pictures with WordsEye requires the user to actively discover the codes of the program, what is allowed and what isn’t. The WordsEye workspace and simple interface enable the user to slowly discover what can be visualized and where the edges of both natural language and the computer language reside, encouraging a certain play between language and image in order to flesh out these codes. In this way, WordsEye represents a unique online writing space, as it makes obvious the contours of its own text-to-image translation.

[1] B. Coyne and R. Sproat. “WordsEye: An Automatic Text-to-Scene Conversion System.” In SIGGRAPH 2001, pages 487 — 496. ACM SIGGRAPH, Addison Wesley, 2001. p. 1.

[2] R. Sproat. Inferring the environment in a text-to-scene conversion system. In Proceedings of the 1st international conference on Knowledge capture, pages 147-154, Victoria, British Columbia, Canada, 2001.

[3] K. Glass and S. Bangay. “Building 3D Virtual Environments from Natural Language Descriptions:Text-to-Scene Conversion.” http://www.cs.ru.ac.za/research/g05g1909/saicsit2005/poster/PosterComments.pdf Last accessed March 2007.

[4] S.R. Clay and J. Wilhelms. “Put: Language-Based Interactive Manipulation of Objects.” IEEE Computer Graphics and Applications, pages 31-39, March 1996.

[5] Joshi et. al. “The story picturing engine: finding elite images to illustrate a story using mutual reinforcement.” In Proceedings of the 6th ACM SIGMM international workshop on Multimedia information retrieval, Pages 119-126. New York, NY, 2004.

[6] S. Dupuy et al. “Generating a 3D simulation of a car accident from a written description in natural language: the CarSim system.” In Proceedings of the workshop on Temporal and spatial information processing. Pages 1-8, 2001.

[7] Part-of-speech-tagger: K. Church. “A Stochastic Parts Program and Noun Phase Parser for Unrestricted Text.” In Proceedings of the Second Conference on Applied Natural Language Processing, pages 136-143. Morristown, NJ, 1988. Association for Computer Linguistics.
Statistical Parser: M. Collins. Head-Driven Statistical Models for Natural Language Parsing. PhD thesis, University of Pennsylvania, Philadelphia, PA, 1999.

[8] Further information on the system can be found in B. Coyne and R. Sproat. “WordsEye: An Automatic Text-to-Scene Conversion System.” In SIGGRAPH 2001, pages 487 — 496. ACM SIGGRAPH, Addison Wesley, 2001.

Resources for Further Study:

B. Coyne and R. Sproat. “WordsEye: An Automatic Text-to-Scene Conversion System.” In SIGGRAPH 2001, pages 487 — 496. ACM SIGGRAPH, Addison Wesley, 2001

R. Sproat. “Inferring the environment in a text-to-scene conversion system.” In Proceedings of the 1st international conference on Knowledge capture, pages 147-154, Victoria, British Columbia, Canada, 2001.

Semantic Light

  nicoles, 04.13.07

Comments are closed.