A real-time augmented-reality system for sports broadcast video enhancement

Share Embed


Descrição do Produto

A Real-Time Augmented-Reality System for Sports Broadcast Video Enhancement Jungong Han

Dirk Farin

Peter H.N. de With

University of Technology Eindhoven P.O.Box 513, 5600MB Eindhoven, the Netherlands

University of Technology Eindhoven P.O.Box 513, 5600MB Eindhoven, the Netherlands

University of Technology Eindhoven LogicaCMG P.O.Box 513, 5600MB

[email protected]

[email protected]

ABSTRACT This paper presents a new augmented-reality system designed to generate visual enhancements for TV broadcasted court-net sports. A probabilistic method based on the Expectation Maximization (EM) procedure is utilized to find the optimal feature points, thereby enabling the automatic acquisition of the camera parameters from the TV image with high accuracy. A virtual camera derived from the original camera, helps to synthesize a variety of virtual scenes, such as the scene from the viewpoint of a player, depending on the intention of the user. To preserve the visual nature of the original human motion, the player’s shape and texture are extracted from the real video and texture-mapped onto the virtual video. The system was tested over a set of court-net sports videos containing tennis, badminton and volleyball and demonstrated promising results.

Categories and Subject Descriptors I.4.8 [Image Processing and Computer Vision]: Scene Analysis—Object recognition

General Terms Algorithm, Design, Experimentation

1.

INTRODUCTION

Recently, computer-generated visualization is increasingly used in sports broadcasting to enhance the viewer experience beyond displaying simple data, like time and current score. One example is that virtual objects, such as virtual offside lines in soccer scenes, are overlaid onto the live video. Another demand of the user is the interaction with the TV content. The user expects the augmented content to better comprehend and enjoy the sports game, e.g., a given scene from the viewpoint of the player. Augmented Reality (AR) of real sport matches facilitates viewers to engage and immerse into the action.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’07, September 23–28, 2007, Augsburg, Bavaria, Germany. Copyright 2007 ACM 978-1-59593-701-8/07/0009 ...$5.00.

[email protected]

Present AR techniques for sports video can be divided into two categories. The research in the first category is focusing on generating virtual scenes by means of multiple synchronous video sequences of a given sports game [1, 2]. However, it is difficult, if not even impossible, to apply such systems for TV broadcasting, since only single-viewpoint video is available to the viewers at any time. This paper concentrates on the systems of the second category [3, 4], which aim at synthesizing virtual sports scenes from TV broadcasted video. In [3], the proposed system performs a camera calibration algorithm to establish a mapping between the soccer court-field in the image and that of the virtual scene. The posture of the player is selected from three basic choices: stop, walk or run, using the player’s motion direction and speed. Finally, computer graphics techniques (OpenGL) are employed to generate an animated scene from the viewpoint of any player. The work reported in [4] is an improved version of [3], where a more advanced tracking approach for the players and ball is realized. Such systems still suffer from two problems. First, the so-called camera calibration only builds a 2D homography mapping without providing the exact camera parameters. As a result, the virtual scene generated by this technique might be quite different from the original scene. Second, the graphics-based animation is unfortunately not very realistic, since the texture and motion nature of the player are lost completely. In this work, we aim to address these problems. This paper proposes a new AR system for broadcast sports video, which contributes in two aspects. First, using a singleview broadcast video, the selection of the feature points in 3D space is improved by employing the Expectation Maximization (EM) algorithm based on a probabilistic framework. Second, as a bonus resulting from the increased accuracy of the projection matrix, we are able to decompose the projection matrix into camera intrinsic and extrinsic parameters, so that new applications are possible. This result outperforms estimating parameters from the homography mappings [4], leading to large errors. Because of the accuracy of the algorithm, our system allows professional applications besides the enhanced viewing experience (free viewpoint). For example, our system can be an efficient replacement for manually controlling the video camera to concentrate on and track an object of interest to be in the middle of the image. Moreover, our system can be a camera simulator, providing the virtual videos with different camera settings and finding the perfect camera position and ideal camera setting through simulation.

2.

CAMERA CALIBRATION ALGORITHM

The task of the camera calibration is to provide a geometric transformation that maps the points in the real-world coordinates to the image domain. Since the real court-net model is a 3D scene but the displayed image is a 2D planar, this mapping can be written as a 3 × 4 projection matrix M, which transforms a point p = (x, y, z, 1)> in real-world coordinates to image coordinates p0 = (u, v, 1)> by p0 = Mp, being equivalent to 0

1

0

0

u m11 @ v A = @ m21 1 m31

m12 m22 m32

m13 m23 m33

1

1 x m14 B y C B C. m24 A @ z A m34 1

(1)

Since M is scaling invariant, eleven free parameters have to be determined. They can be calculated from six points whose positions are known in both the 3D coordinates and the image. Matrix M can be further decomposed into camera intrinsic and extrinsic parameters, described by M = K[R| − Rt], where

2

fx K=4 0 0

s fy 0

3

2

3

(2) 2

3

iT u0 tx 5 4 v0 ; R = jT 5 ; t = 4 ty 5 . 1 tz kT

(3)

The upper triangular calibration matrix K encodes the intrinsic parameters of the camera. Parameters fx , fy represent the focal length, (u0 , v0 ) is the principal point and s is a skew parameter. Matrix R is the rotation matrix with i, j and k denoting the rotation axes. Vector t is the translation vector. R and t are called the camera extrinsic parameters.

2.1

Related work

Several publications [1, 5, 6, 8] have been devoted to the camera calibration for sports video. Liu et al. [5] propose a self-calibration method to extract 3D information in broadcast soccer video. This work is based on Zhang’s method [7], where the camera is calibrated from two homography mappings without the need of 3D geometry knowledge of the scene. Zhang’s technique assumes that the camera intrinsic parameters during the calibration process are fixed, which cannot be applied in running broadcast videos, where e.g. the focal length changes frequently during video capturing (this explains the reported error in [5], which is around 25%). In [6], the authors present a novel method for calibrating tennis video using 6-point correspondences, and design different methods to refine the clip-varying and frame-varying camera parameters. Simultaneously, we proposed a system [8] for calibrating court-net sports video based on randomly selected points on the net line to indicate the height of the scene. The difference with the previous paper [6] is that it relies on the detection of the top points of two net posts, which is not robust as the net posts may not be visible in the image. Our method proved to be more generally applicable (we only need a part of the net) to court-net sports video containing badminton, tennis and volleyball. In this paper, we adopt the basic concept of [8] that takes at least six points arranged in two perpendicular planes to compute M. The court and net lines characterize these two planes. Additionally, we explore the EM approach to optimally select feature points, instead of random selection [8], thereby improving the accuracy of the projection matrix M.

Figure 1: The lines and points are selected in the image and the correspondences in the standard model are determined. Six points are used for calibration. Our aim is to make matrix M accurate enough to enable a decomposition into camera intrinsic and extrinsic parameters to create a virtual camera.

2.2

Selection of EM-based feature points

Our algorithm starts with the detection of the court-net lines. Here, an automatic algorithm from [8] is applied, involving many components and techniques, like white pixel detection, line detector, court model fitting, model tracking, net-line detection and net-line refinement. Once having obtained the court-net lines, we require six point correspondences from these lines to compute M. In our approach, we select four points from the ground plane, and two points from the net plane. In the ground plane, the intersections of the court lines establish four point correspondences (see Fig. 1 for an example). The way to extract feature points on the net line is actually more complex. In our previous work [8], we assume that any vertical projection line onto the ground plane in the 3D domain remains vertical in the image domain. In this way, two arbitrary points p05 and p06 on the net line in the 3D model (see Fig. 1) are corresponding to T5 and T6 in the image. However, since the broadcast video is normally captured from the top view, this assumption does not hold in many practical cases. It appears that there is a slight slope difference between the 3D projection line and the visible line in the image, which corresponds to the viewing angle of the camera onto the scene. This angle increases with a higher position of the camera. Although this phenomenon does not change the projection matrix significantly, it has a profound influence on a possibly accurate decomposition of the matrix. Depending on the position of the camera and the distance between points in the image, the projection matrix may not be Euclidian from nature, so that the decomposition into camera parameters gives large errors. Our strategy is to find those feature points on the net line that yield a better Euclidian setting of the projection problem so that the decomposition of the projection matrix becomes sufficiently accurate. In this paper, we first adopt our method in [8] to extract two initial net line points. Many candidate points around these two initial points can be easily found (see Fig. 1). Afterwards, a EM-based method is employed to classify these candidates into two categories: Acceptable Points (AP) and Rejected Points (RP). From the set of APs, we select the best point through maximum likelihood inference. Suppose that the computed M can be decomposed into camera parameters described by Eqn. (2). Ideally, the principal point (u0 , v0 ) should be at the center of the image. Due to the presence of noise, it may not be at the exact image center, but at least it should be close to the image

center. In other words, the distance between the computed principal point and the image center can be used to evaluate the quality of matrix M. Based on this distance, a candidate point is classified into an AP or a RP, using the iterative EM procedure. At each point, indexed by k and assuming N candidate points, we have a two-class problem (AP = w1 , RP = w2 ) based on the mentioned distance dk . More specifically, we need to estimate the posterior p(wi |dk ) for each point. Given by the Bayesian rule, this posterior equals to [9] p(wi |dk ) =

p(dk |wi , µi , σi )p(wi ) . p(dk )

(4)

P

2 Here, p(dk ) = i=1 p(wi )p(dk |wi , µi , σi )), which is represented by the Gaussian√Mixture Model (GMM). In addition, p(dk |wi , µi , σi ) = 1/( 2πσi )exp(−(dk − µi )2 /2σi2 ). Now, the problem reduces to estimating p(wi ), µi and σi , which can be iteratively estimated using the EM update equations. The quality of the estimates grows with the amount of iterations, of which the count is denoted by the superscript in the following:

p(n+1) (wi ) = (n+1)

µi (n+1)

σi

p(n+1) (wi |dk ) =

PN

p(n) (wi |dk )dk

k=1

k=1

(5)

k=1

k=1 = P N

PN

=

N 1 X (n) p (wi |dk ), N

p(n) (wi |dk )

p(n) (wi |dk )(dk − µi )2

PN

,

(6)

,

(7)

(n) (w |d ) i k k=1 p (n) (n) p(dk |wi , µi , σi )p(n) (wi ) . P2 (n) (n) (n) (wi ) i=1 p(dk |wi , µi , σi )p

(8)

The EM process is initialized by choosing class posterior labels based on the observed distance; the shorter the distance of a point, the greater the initial posterior probability of being an AP, so that q

p(0) (w2 |dk ) = min(1.0, dk / p

(0)

(w1 |dk ) = 1 − p

c2x + c2y ),

(0)

(w2 |dk ).

(9) (10)

Here, (cx , cy ) denotes the image center. With this initialization strategy, the process stabilizes fairly quickly (n > 15).

2.3

Maximum likelihood estimation

Until now, we already acquired the APs, the next step is to find the best point from the APs. The basic idea is to evaluate the distance between a virtual court-net configuration and the detected court-net configuration in the picture. We obtain a virtual court-net configuration by projecting the 3D real-world court-net configuration (derived from the model) to the video image based on Mk . Matrix Mk is computed by using the kth point of the set of APs. By varying k and repeating the projection, the configuration with the best match can be identified as the best, final solution. This is found by minimizing a matching error, defined by Ek =

m X

kLij , Lvj (Mk )k.

(11)

j=1

Line Lij is the j th detected line of the court-net configuration formed by m lines in the picture, Lvj (Mk ) denotes the corresponding line in the virtual configuration. The metric

Figure 2: Detection of the court lines and net line. Table 1: Court-net detection and camera calibration Type Badminton

Court 98.7%

Net 96.2%

Tennis

96.1%

95.8%

Volleyball

95.4%

91.8%

CC(30frames) [8] method 1.6 our method 1.1 [8] method 2.5 our method 1.3 [8] method 3.2 our method 2.1

PP (388,-121) (372, 256) (399,-232) (395, 249) (320, 98) (340, 220)

k·, ·k denotes the distance between two lines. The matrix Mk giving the minimum E is selected as the best one.

3.

VIRTUAL CAMERA GENERATION

As mentioned before, a virtual view is realized by changing some of the original camera parameters. In our system, the change fully depends on the user’s intention. For example, assume that the user wants to keep one player around the vertical midline of the image by rotating the camera. Such a feature is realized by minimizing the following function: w ˆ t, p)|, D = | − Px (K, R, (12) 2 ˆ t, p) is the x-coordinate of the projection where Px (K, R, point of p in the image. p is the real-world position of the ˆ is the target player, and w is the width of the image. R virtual rotation matrix. For applications where more than one parameter has to be changed at the same time, minimizing Eqn. (12) is a complex nonlinear minimization problem, which is solved with the Levenberg-Marquardt algorithm. It requires an initial guess of K, R and t, which can initially be the original camera parameters. To preserve the visual nature of the original human motion, the player’s shape and texture are extracted from the real video and texture-mapped onto the virtual video. To this end, a novel player segmentation algorithm is applied, which can be found in [8].

4.

EXPERIMENTAL RESULTS

We tested our algorithms on six video clips extracted from regular television broadcasts. Three of them are tennis games on different court classes, two are badminton games, and one is a volleyball game. Fig. 2 shows sample pictures for the court-net detection, where two difficult scenes are selected. Evidently, the method of [6] will fail here, as one net post is not visible. Table 1 shows the evaluation results of our algorithm, which indicate that the detection of the court-net lines is correct for more than 90% of the sequences on the average. Moreover, we compare our algorithm with the method from [8], and both algorithms to the ground truth on the basis of a manual selection of a few intersections in the 3D domain. We have transformed those 3D reference points to the image domain

Figure 3: Augmented reality generation. using the transformation matrix, and we have measured the distance between the projected 3D reference intersections and another set of manually selected line-crossing points in the image domain (ground truth). Table 1 shows that for a badminton clip, the average distance obtained by [8] is 1.6 pixel/point, and our new algorithm has only an error of 1.1 pixel/point. We also give the coordinates of the camera principal points (PP in Table 1) computed by the two methods. Again, our PP is much closer to the image center. Furthermore, we have estimated the heights of two tennis players (Sanchez and Sales) using our algorithm. The estimated height of Sanchez was 164cm, and her real height is 169cm. Sales’ estimated height is 172cm, and her real height is 178cm. It can be concluded that our estimation error is only 4%, whereas the reported error in [5] is 25%. Fig. 3 demonstrates some virtual scenes generated by our system, where three sports games are shown. The first row shows the original images. We generate two different virtual scenes for each game. For the single match of badminton, we translate the camera such that (1) one player is always located at the midline of the captured image, and (2) the viewer watch the game from an arbitrary viewing angle. In the double match, we demonstrate the cases that with increased height of the camera, and a modified focal length of the camera. In the tennis match, the clay court is changed into grass court. One virtual scene is produced by rotating the camera with the motion of a player. Another example portrays the scene from the viewpoint of a moving player, enabling the viewer to enjoy the real match. Our virtual scene proved to be very realistic, because we preserve the shape and motion nature of the player, which is better than animation-based systems [3, 4]. More important, our AR generation system is executed on a P-IV 3GHz PC, achieving a near real-time speed. The average computation for one frame is 30–59 ms, depending on the complexity of required virtual scene and also the image resolution.

5.

CONCLUSIONS

We have presented a real-time AR system for broadcast sports video. Here, two major contributions are realized. First, we have improved the camera-calibration algorithm of [8] in the sense that we optimally choose feature points based on the probabilistic EM technique, leading to a more

accurate transformation matrix. This matrix is sufficiently accurate to be decomposed into camera intrinsic and extrinsic parameters. Second, we have built our AR system upon the precise camera parameters computed from a single-view broadcast video. By changing these parameters, it is possible to generate virtual scenes from arbitrary viewing positions. These positions may be defined by the user. The performance of the system for various court-net sports exceeds 90% accuracy on camera calibration. Moreover, the system can create many realistically virtual scenes.

6.

REFERENCES

[1] T. Bebie and H. Bieri. A video-based 3D-reconstruction of soccer games, Eurographics, Vol.19, pp.391-400, 2000. [2] N. Inamoto and H. Saito. Free viewpoint video synthesis and presentation of sporting events for mixed reality entertainment, In Proceedings of ACM ACE, Vol.74, pp.42-50, 2004. [3] K. Matsui, M. Iwase, M. Agata, T. Tanaka and N. Ohnishi. Soccer image sequence computed by a virtual camera, in Proceedings of CVPR, pp.860-865, 1998. [4] D. Liang, Y. Liu, Q. Huang, G. Zhu, S. Jiang, Z. Zhang and W. Gao. Video2cartoon: generating 3D cartoon from broadcast soccer video, in Proceedings of ACM Multimedia, pp.217-218, 2005. [5] Y. Liu, D. Liang, Q. Huang and W. Gao. Extracting 3D information from broadcast soccer video, Image and Vision Computing, Vol.24, pp.1146-1162, 2006. [6] X. Yu, X. Yan, T. Chi and L. Cheong. Inserting 3D projected virtual content into broadcast tennis video, in Proceedings of ACM Multimedia, pp.619-622, 2006. [7] Z. Zhang. A flexible new technique for camera calibration, IEEE Trans. PAMI, Vol.22, pp. 1330-1334, 2000. [8] J. Han, D. Farin, P. de With. Generic 3-D modeling for content analysis of court-net sports sequences. in Proceedings of MMM, pp.279-288, 2007. [9] R. Duda, P. Hart and D. Stork, Pattern Classification, wiley, 2001.

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.