VITUAL CAMERA CONTROL SYSTEM FOR CINEMATOGRAPHIC 3D VIDEO RENDERING Hansung Kim1, Ryuuki Sakamoto1, Itaru Kitahara1,2, Tomoji Toriyama1, and Kiyoshi Kogure1 1
Knowledge Science Lab, ATR, Kyoto, Japan {hskim, skmt, kogure}@atr.jp 2 Dept. of Intelligent Interaction Technologies, Univ. of Tsukuba, Japan
[email protected] ABSTRACT We propose a virtual camera control system that creates attractive videos from 3D models generated with a virtualized reality system. The proposed camera control system helps the user to generate final videos from the 3D model by referring to the grammar of film language. Many kinds of camera shots and principal camera actions are stored in the system as expertise. Therefore, even nonexperts can easily convert the 3D model to attractive movies that look as if they were edited by expert film producers with the help of the system’s expertise. The user can update the system by creating a new set of camera shots and storing it in the shots’ knowledge database. Index Terms—Camera control, Cinematographic 3D video, Virtualized realty, 1. INTRODUCTION There have already been some studies conducted on using video cameras to regenerate video captured at arbitrary viewpoints in a 3D space using the technique of Virtualized Reality [1][2]. The technique reconstructs 3D models in the space by merging multiple videos using computer vision techniques, and generates 3D free-viewpoint videos by applying CG technology to the reconstructed 3D model. We have developed a free-viewpoint rendering system using multiple cameras as shown in Fig. 1 [3]. The system reconstructs 3D models from captured video streams using a shape-from-silhouette method and generates realistic freeview video of those objects from a virtual camera. Although we can generate free-viewpoint video from the Virtualized Reality system, one more important problem remains: how can we produce attractive videos from generated 3D models? In the case of video productions for film and television, they attract audiences by changing camera positions and camera actions in response to each captured situation (hereafter, we refer to these two attributes of a camera as “shots”). At each scene, there are several choices of shots. Interestingly, different shot choices
produce different impressions and effects, even if the captured scene is the same. The “grammar of film language” formalizes these differences and describes the rules of filmmaking to produce easily understandable and attractive footage for audiences [4][5]. Matsushita et al. applied this “grammar” to the 3D CG (computer graphic) world and verified its effectiveness by rendering video productions [6]. However, this grammar has not yet been applied to real events in the real 3D world. In this paper, we propose a cinematographic virtual camera control system that helps the user to generate final videos from the 3D model by referring to the grammar of film language. Many kinds of camera shots and principal camera actions are stored in the system as expertise. Therefore, even non-experts can easily convert the 3D video to attractive movies with the help of the system’s expertise. Fig. 1 shows a flow diagram for creating a cinematographic video. 2. CINEMATOGRAPHIC CAMERA CONTROL The grammar of film language is based on a constrained condition for switching sequential camera shots. Since each single shot is nothing more than a video fragment, it is necessary to combine many shots to generate an entire video. We call such sequential combination of shots a ‘scene.’ In this section, we describe camera shot information and the constrained condition for combing sequential shots to construct a scene. 2.1. Camera Shot Camera shot information generally comprises two types of information for camera control: initial camera parameters and camera actions. The initial camera parameters are set to appropriately capture the target objects in the initial state. Specifically, they describe the relative angle between a target object and a capturing camera, in addition to the size and position of the object in the captured image as shown in Fig. 3. Labeling these values with aliases (e.g. BIRDS EYE,
Figure 1. Free-viewpoint rendering system Table 1. Taxonomy of camera actions Fixed Moving independently
Figure 2. Flow diagram for making a Cinematographic video BIRDS EYE
Shot name FixShot, BustShot, MediumShot, LongShot CraneUpShot, CraneDownShot, RaisUpShot, SpinAroundShot, TimeSliceShot
Moving tied to the target
PanShot, DollyShot
Zoom
ZoomOutShot, ZoomInShot, WhipZoomShot
2.2. Constraints on Switching Shots
SUPER HIGH
HIGH
。 90
。 θ =45 End Point
REGULAR
0
。 135
。 Start Point ( = Target Point) 180。
θ
POV *
315
。 270
。
。 225
The constrained condition for switching camera shots determines the constraints on continuity between the current shot and the next shot. We provide the following two constraints on camera-shot switching to produce easily understandable and attractive videos by referring to the grammar of film language.
LOW
SUPER LOW *POV : Point Of View
Figure 3. Initial camera parameters
SUPER LOW, etc.) makes it easier to add new shot information. The camera action describes variations in camera position and the zoom parameter. In the proposed system, it is possible to set the values in the following two ways: one is by using the time-series relative difference from the initial state, and the other is by calculating the values by interpolating the initial state and the exit state, which are given as input information. The camera actions are preconfigured and stored in the database. As Table 1 shows, the proposed system provides fourteen sets of camera actions.
◆ Do not set the next camera shot to stride across the imaginary line. This confuses the audience. ◆ Do not choose a following camera shot that is similar to the current shot because the similarity reduces the effectiveness of the switch. 2.3. Applying Camera Shot The system generates a scene with a declared set of camera shots that satisfy the constraints on switching shots. If a suitable set of camera shots is found in the preserved shots’ film-knowledge database, the user declares the retrieved scene valid. If the user cannot find a suitable set of shots, however, the user can create a new set of camera shots and store it in the shots’ knowledge database. The declared or created set of
Multiple-video capturing
3D Modeling
Temporal Annotation (frame number)
Index 3D Model
Annotating Module
Camera Parameter
Cinematographic Camera Controlling Module
Free-view Video generating Module Video Rendering
Top View
Input I/F
Frontal View
Right View
Index List
Cinematographic Video
Figure 4. Block diagram of the pilot system
shots is nicknamed for easy future access (e.g. Dramatic, Suspense, etc.). To apply the shots to the scene, it is necessary to know the positions of the target objects/actors and to know the capturing-time code. We use annotation information to note the positions and time codes for notifying the system. Annotations are assumed to be made not only by humans but also by various sensor inputs such as IR sensors and pressure sensors. An imaginary line is also estimated according to the annotation information. 3. PILOT SYSTEM In this section, we introduce our pilot system for creating cinematographic 3D video. As Fig. 4 shows, the system consists of a free-view generating module, an annotating module, and a cinematographic camera-controlling module. 3.1. Free-view Generating Module We have implemented a distributed system using two PCs and eight synchronized IEEE-1394 cameras which provide 1024×768 color video streams at a speed of 30 frames/sec. The cameras oriented toward the center of the space to capture almost the same area. An intensity-based background subtraction method was used to segment the foreground regions [7]. The segmentation masks and texture information are sent via a 1-Gbps (gigabits per second) network to the modeling PC. The modeling PC reconstructs the 3D shape of the target object as a voxel volume with the ‘shape-fromsilhouette’ method, then synthesizes a 3D video from a virtual camera. The 3D space was modeled at a resolution of 300×300×200 on a 1cm×1cm×1cm voxel grid and a microfacet billboarding technique [8] is employed for rendering to generate fine free-view video. 3.2. Annotating Module When a camera shot determines the capturing parameters of a virtual camera, annotation information is necessary to indicate target objects. Annotation information consists of both spatial and temporal information. A spatial annotation is explained by the 3D positions of the target objects and their 3D regions, while a temporal annotation is described according to
Spatial Annotation (face)
Spatial Annotation (whole body)
3D Position of Each Spatial Annotation
Figure 5. Annotating tool
its time code. Fig. 5 shows a screenshot of our developed interface application for inputting spatial and temporal annotations simultaneously. Here, a spatial annotation is defined by dragging a mouse over the area, while a temporal annotation is defined by clicking on a certain point on a timescale bar. These annotations are recorded with index information and an extra “user’s area” described in a free format. The annotations are assumed to be manually input, although it is not practical to input all temporal annotations in this way because there are too many frames in a captured video sequence to be processed by humans. We solve this laborintensive problem by interpolating two different temporal annotations that have the same index information. 3.2. Camera-Controlling Module Finally, the system generates a footage by piecing all of the generated videos. Fig. 6 (a) shows example footage of a cinematographic video with camera controls using two annotations to 3D regions and five annotations to time codes. Varied shots with different angles and framing are set for the 3D video to capture a man shadowboxing dynamically. Fig. 6 (b) shows other footage that is applied to the same shots, to which a region annotation is added to the man's foot. Despite the fact that these pieces of footage made from the videos of the same scenes, the impressions they give are rather different. In the example of “Dramatic”, the camera action “CraneDownShot” gives the initial state as “Long” framing and a “BIRDS EYE” angle, and then the camera position is moved gradually closer to the ground. In the “Suspense”, on the other hand, the “DollyShot” maintains a fixed distance to the annotated target, the foot, using “Close” framing and the “REGULAR” angle. Fig. 7 shows the flow of the camera work in the footage of “Dramatic”. These pieces of footage indicate that the user can aim the camera to generate a dramatic video as shown in Fig. 6 (a) and a suspense one as in Fig. 6 (b). Clearly, then, our system can produce a video in response to the user's request. Video clips showing the results can be downloaded from the following addresses. http://coolhs99.cafe24.com/Eng/Dramatic.wmv http://coolhs99.cafe24.com/Eng/Suspense.wmv
(a) Dramatic Fig. 6. Outcome of shadowboxing scene
(b) Suspense
Fig. 7. The overview of the camera operations “Suspense”
4. CONCLUSION The goal of this study is to develop a virtual camera controlling system for creating attractive videos from 3D models. The proposed system helps users to apply expert knowledge to generate desirable and interesting film footage by using a sequence of shots taken with a virtual camera. As future work, we are going to devise a method to use sensors to automatically determine annotation information. ACKNOWLEDGEMENT This research was supported by the National Institute of Information and Communications Technology. REFERENCES [1] P. Rander, P.J. Narayanan, and T. Kanade, “Virtualized reality: constructing time-varying virtual worlds from real world events,” Proc. Visualization, pp. 277-283, 1997
[2] T. Kanade and P.J. Narayanan, “Historical Perspectives on 4D Virtualized Reality,” Proc. CVPR, pp. 165, 2006. [3] H. Kim, I. Kitahara, R. Sakamoto, and K. Kogure, “An Immersive Free-Viewpoint Video System Using Multiple Outer/Inner Cameras,” Proc. 3DPVT, 2006. [4] D. Arijon, Grammar of the Film language, Silman-James Press,1991 [5] S.D. Katz, Cinematic Motion: Film Directing: A Workshop for Staging Scenes, Michael Wiese Film Productions, 2004. [6] A. Inoue, H. Shigeno, K. Okada, and Y. Matsushita, “Introducing Grammar of the Film Language into Automatic Shooting for Face-to-face Meetings,” Proc. SAINT, pp. 277-280, 2004. [7] H. Kim, I. Kitahara, K. Kogure, T. Toriyama and K. Sohn, “Robust Foreground Segmentation from Color Video Sequences Using Background Subtraction with Multiple Thresholds,” Proc. KJPR, pp. 188-193, 2006. [8] S. Yamazaki, R. Sagawa, H. Kawasaki, K. Ikeuchi, and M. Sakauchi, “Microfacet billboarding,” Proc. Eurographics Workshop on Rendering, pp. 175–186, 2002