Data compression producing structures—Discoveries

May 25, 2017 | Autor: Gerard Guiho | Categoria: Information Systems, Data Compression, Library and Information Studies

Descrição do Produto

DATA COMPRESSION PRODUCING STRUCTURES-DISCOVERIES G. GUIHO Institut Specialis&d’hrformatique, Universite Paris&d, Centre d’orsay, Batiment 336-91405Orsay, France (Received 24 June 1975)

Abstract-This paper describes a formalism to construct some kinds of algorithms useful to represent one structure about a set of data. It proves that if we do not take into account cost considerations of one algorithm, one can partialy replace the memory by an algorithm. It also proves that the remaining memory part is independant of the construction process. It then evaluate the affects of algorithms representation cost and gives the resulting memory gain obtained in two particular examples.

1. INTRODUCTION Let

X and N be some finite sets and XC N, the following question will be answered (Q) is a given x in X or not?

We shah actually study the amount of memory required to represent X when N is known. There are two different aspects. (1) One can assume there is no particular structure on the set X and study the average number of access to retrieve one information in a memory versus the quantity of memory required to represent X. (2) One can try to find out one structure on the set X and study how this knowledge can reduce the amount of memory which is necessary to answer the question Q. This second aspect will be studied in this paper. 2. DEFINITION

OF OPERATORS

Following a formalism analogous to [i], an operator 0 will be defined as a finite set of predicates {P,(x)},,,, defined on a non-null domain D. Every predicate P, is associated with an injective function ftfrt + II), which will be refered to here as an associated reduction. The set of predicates has the following properties: --It is complete

--It is mutually exclusive Vx(-P(x)u

-Pi(x))

Vi,ji#

j,

These conditions express the fact that only one predicate takes the value “True” for a given x. Let X and N be two finite subsets of D, with XCN

and xi and Ni the image byfi of the subsetsof X and N where P,(x) is true. We will write

xi =f,(X) N= NV) (not to be confused with the preceding /i). 133

G. GUIHO

134

Let M(X, N) be an application of 9(D)

x 9’(D)

in R’, with:

XCN ~(0, N) = 0 VN,M( N, N) = 0.

We suppose the set of associated reductions to be such that M(X, N) 2

z:M(xi,

I%).

A set of operators {O,),,K will be sutisfactory if VX, N, Xf

0,X+ N, X C N IOk : M(X, N) > 2

M(&

Ni).

iEI

The rationale behind these definitions is the following: D will be a set of elements which for example represent the integers in a computer. If the operator is applied to a value x and if P, is the predicate which is true for x, the associated reduction will effectively translate the information coming from the knowledge that P, is true. The function M(X, N) can similarly be interpeted as an “amount of representation” of X, knowing N, i.e. as the amount of memory, required to represent X when N is known. We can graphically represent an operator in the following way:

Its ~~~~~sufjo~on X and N will similarly be defined as:

where si is the output of (Pi>ft). 3. COMPOSITION

OF OPERATORS

and O2 be two operators respectively defined by [{Pi(X)}ic,y{A(x)),td and ((I~;(x),EJ~ {f;(.~)]~~,),and 1 E f. By definition the operator 6%= @,‘O, is associated to ({Y,(x)}~~R, {%(x)~~~IR) Let

0,

with:

(9

{4r(X)}rtH = {p,(x)],., u ~P:(fr(,~))AP,(x)},,~J

(ii)

{g,(x)}&? = {J(X,}it, u{f:(fi(x))hEJ

(iii)

R=(f-I)Ef.f

where

/.l={u/u =/j,iEJ).

Data compression producing structures-discoveries

135

One can easily sheck that 0’ is an operator and that its graphical representation is the following.

This composition low will be referred to as the %‘,low and independently of 1: % 4. GAIN OF AN OPERATOR

4.1 Definition We define a gain of “representation” brought by an operator 0 on a subset X CN as g B.X,W= M(X, N) - 1 M(X, N), by definition, gH.X,N>O. iE1 4.2 Property 1 The gain is additive with respect to %,

= M(X, N) -

2 M(X, Nii) + Mwl, N,) -z WJG,, NJ I

i

= ge,.x.N+ge*.x,,cY,.

For given X and N let %(X, N) be the restriction of %‘,to the subset of operators with strictly positive gains (gB2.XI.NI > 0). Let [X] be the cardinal of X. Henceforth, we will assume that M(X, N) is a function of [X] and [N] only.

Let {O,},,T be a finite number of operators which we shall refer to as the starting operators. 4.3 Property 2 The number of operators finite.

that can be constructed from the starting operators

by %(X, N) is

We shall show that no infinite construction is possible. At the sth step of the construction let:

where nx is the number of sets Xw corresponding to an output 6.1on which one can apply a new operator (i.e. 30, : g8P.XW,NW > 0). Let us show that m, is a strictly decreasing series. At each step, there exist several possibilites: (1) Let us assume [X,] 3 2 then: (i) OS+, splits X into xl,, xfz. . . . where at least two of these subsets are distinct from 0. Then nx can either: -decrease by 1 if no other operator can be applied on X,, X, . . . . . In this case, Z [XW] w decreases at least by 2, and then n,Sdecreases at least by 1.

136

G.

GUIHI~

-remain unchanged. In this case, it is possible to apply a new operator on only one of the X,,, and then C [X,] decreases at least by I again. w -increase and then m., strictly decreases. (ii) 0, +, yields only one X,, f 0 then [X,] = [X,,] an [N,,] necessarily decreases because if it did not, one would have

M(X,, N/J = M(X, N! ). then g w, ,_~,,h,= 0 contradicting l

the assumption

made on %(X, N).

(2) [x/l = 1 a,+, can only split X, into one subset distinct from 0 which is exactly case (ii). This ends the proof of property 2. An operator will be muximal on X and N if it cannot be composed with another by %(X, N). 4.4 Property 3 All the muximul operators huve the sume gain. Let ei and 0; be two maximal operators, and {.G}~ (Rep. {s,}I ) the subset of outputs of 0, (resp. Oi). Let 0 = 0,‘~0,‘0,”

0: 0,

be an operator constructed by the application of Oi on all the outputs of 0, (using % since %(X, N) cannot be used in this case). Since 0, is maximal go.X.N = g,,.,Y.N. Because of the composition low of the predicates in %’ we have

which can be visualised

Similarly,

Then g,,,,,,

in the following way

since Oi is maximal

= g,,,,,,V which concludes

4.5 Property 4 If the set of sturting operators

the proof.

is sutisfuctory then for every muximul operator RO.X..N=

M(X, N).

We first show that g,,,,, cannot have any output si such that X = 0 and Xi -Z N, strictly: if it did, there would exist 0’ with gH..,V,.,.N,. > 0 since the set of starting operators is satisfactory, and then 0 would not be maximal.

Data compression producing structures-discoveries

137

Consequently, every ouptut is such that X, = 0 or Xi = Ni and then M(X, Ni) = 0 which concludes the proof. To sum up, we have shown that whatever the constructing process of operators such the corresponding gain is strictly increasing: (i) A maximal operator is reached after a finite number of operations. (ii) If the set of starting operators is satisfactory, the gain finally reached is equal to the initial “amount of representation” (or memory). 5. COST

OF AN OPERATOR

We have shown so far that one can totally replace a memory by an operator (i.e. an algorithm) this replacement is not actually worthwhile, because the memory required to represent the operator itself must be taken into account. In order to do so, we shall associate a cost C(0) to every operator 0: C(0) is strictly positive and additive with respect to: Vl,C(O,‘O*) = C(O,) + C(O2). One can then define a new composition law that will allow operators composition only if the “real gain” that is obtained by this composition is positive; here the “real gain” will be defined as: RR.X,.N,-

CC@).

As before, all constructing processes using this law will be finite. The maximal operators of this new law will be called valid operafors. In contrast with the maximal operators, valid operators do not have the same real gain. Finding an operator-constructing method that would yield a minimum real-gain valid operator can be treated as a classical optimisation problem. As is well-known, no optimal method exists besides an impractical exhaustive enumeration of all operators constructions. As a practical construction method, we shall choose the output I of 0, and 0, +, such that: Rn,,l.X,.N,-

C(O,+J

be maximal at each steps s. Construction will stop when a valid operator is reached. Memory

MinImum

with

cost

Mnmum

L--L Nb of steps

6. PARTICULAR

CHOICES

FOR THE

AMOUNT

OF REPRESENTATION

(I) Assuming N to be the set of integers from 0 to [N] - 1the most usual choice for M(X, N) is: is: M(X, N) = [Xl log2[Nl since log2[N] bits are required for each of the [X] values x. Actually, we must require: M(0, N) = M(N, N) = 0

ci.

138

(jtllH0

according to our previous assumptions, so that we shall define M(X, N) in the following way: x S [N]/2

M(X, N) = [XI log*[Nl

x 3 [??I/?

M(X, N) = [N -X] IO&IN].

The associated reductions should be such that the corresponding outputs be sets of integers from 0 to [Nil - 1. If they are not, we will take N, = N. With these particular choices, one can easily check that V8.C M(Xi, Ni 1s M(X NJ. iE,

Where equality is only reached when N, = N for all i. (2) This choice for M(X. N) does not correspond to a minimum of representation as shown in 12-51.The minimum of representation can be reached by using an enumeration of the set of subsets of N that have [X] elements. There are

such subsets. so that log,

required to characterise any one of them. We shall accordingly choose: M(X,N)=log:!

1x1

(

,Nl

!

This function fulfils all necessary assumptions and the relation

holds if the associated reductions have the same properties as in (I). Actually this function is inconvenient for use on a computer. A good approximation is

M(X, N) = IX1log [NlllX]. This formula is the one (I), but for the -fXJ log [X] term, which can be interpeted as the information brought by the knowledge of the order of the values of N [2]. 7. RESULTS

Two examples where chosen: (1) X is the set of COBOL reserved words and N is the set of at most 21-Character strings composed with 38 different characters. (2) X is a set of 256 arithmetric expressions and N is a set of at most IO-character strings composed with 9 symbols.

Data compression producing structures-discoveries

139

On these two examples, two types of starting operators were chosen. (1) Length operators: strings lengths are compared to a given length and strings are coded accordingly. (2) Character operators: The ith character is compared to a given one, and when they are identical, the corresponding chain is truncated. The corresponding valid operators yields a memory representation real gain of the order of 65% (see Table 1). Acknowledgemenrs-The participation.

author would like to express his gratitude to C. LOYO, J. F. PARISand C. JABLONfor their helpful

REFERENCES [I] Z. MANNA, Formalisation o/Properties of Programs. Standford (July 1%8). [2] G. GUIHO,These d’Etat Universite Paris VI, (in French) (DCcembre 1973). [3] G. GUrHO,Gain thtorique de m&moire obtenu a I’aide d’un algorithme de prttraitement. C.R. Acad. Sci. Paris, t 276, (in French) (Janvier 1973). [4] G. GLIIHO,Mt!ihodes de recherche en mimoire ef t?ude d’optimisafion. Colloque International sur les Mtmoires, (in French), Paris (Octobre 1973).

Lihat lebih banyak...

Data compression producing structures—Discoveries

Descrição do Produto

Comentários