Sequence alignment by word processor

May 30, 2017 | Autor: Ross Boswell | Categoria: Biological Sciences, Sequence alignment, CHEMICAL SCIENCES
Share Embed


Descrição do Produto

T I B S 1 2 - J u l y 1987

279

Microfile Sequencealignmentbyword processor D. Ross Boswell The alignment of protein and gene sequences for maxnnum homology is a problem d,fficult to attack w~th pencd and paper The sequences often have ,nsertions or deleuons that need the insertion of gaps m one or more sequence to maintain the ahgnment The msemon of a gap changes all the al,gnments distal to it, and it needs only a few such insemons to make nonsense of the written ahgnment and require lengthy transcription Cut and shde methods with the sequences on paper slips are workable, but again lengthy and errorprone transcription is needed to take down the final result Some word processing systems available on microprocessors are capable of handhng very long lines of text - 1000 characters or more, which encompasses most of the protein and many of the gene sequence segments that one might wish to align. Such a system has particular advantages. (1) The sequences can be captured downhne from a sequence database on a large computer system (2) Changes in the alignment made by the insertion or deletion of gaps are reflected mstantly on the screen (3) The resulting ahgnment is available for direct input into programs calculating degrees of slmllanty, and for pnntmg without nsk of transcnptlon errors The word prt, cesslng programs I have used for sequence alignment on IBM PC-compatible microprocessors are W O R D S T A R and PCWRITE W O R D S T A R (Trade-name of M~croPro Intemauonal) is a wldely-avadable word processing program it is capable of handhng lines longer than 1000 characters, provided word-wrap is disabled ( ' O W ) to prevent wrap-around at the ends of hnes that contain embedded

blanks Insertion or deletion of a gap ts reflexed reasonably quickly m the &splay on the screen, as is movement of the cursor to left or nght that requires ht,nzontal scrolling However, the update ot the display takes an appre~nable ume, leading to a strange weawng effect tnat can be distracting at Umes as hnes ~ e moved one after another It offers the particular advantage that block column mode (enabled by ^KN) allows the ahgnment to be chopped into hne-length pieces for pnnung, or allows such a 'dissected' alignment to be reassembled into long lines again It is adwsable to use non-document mode to prevent W O R D S T A R s e ~ n g the high-order bits of the characters as flags, and producmg a file that is incompatible with other programs In non-document mode, word-wrap disabled is the usual default PCWRITE ,s ava,lable as publicdomain software the copyright owner Qmcksoft encourages the free copying and dlstnbutlon of an edited version (capable of handling documents of 20 to 30 pages) with mchmentary manuals, and offers an improved version with full manuals to those users who register and pay a fee It does all edmng in RAM, and ,s much faster on funcuons such as search or search-and-replace than

4 0

W O R D S T A R which must scan through a d,sk file Screen updates are very fast, pamcularly when the rmcroprocessor does not have an IBM colour display This particular display generates snow when scrolled rap;dly, the PCWRITE has the faclhty of incorporating a delay to avoid that problem Word-wrap can be disabled by turmng off automaUc reformat w~th the control hne ' < a G > - ' at the begmmng of the file There does not appear (at least in the pubhc-domam version) to be a 'block column mode' equwalent It does, however, offer the very useful facih.'y of defining 'bookmarks' m the file that can be scrolled to w~th a single keystroke, so that the effects on a single site of vanous changes made at remote sites can be rapidly tested Having tned both systems, I prefer PC'WRITE for sequence ahgnment The editing commands are rather more friendly, the process of editing is faster and more pleasant, and in place of the rather tedious chopping of the alignment ! use a program written in PASCAL to read the ahgnment file, calculate ~denuty fractions, and Imnt the ahgnment m blocks My prachce is to use the first 20 characters of each line for the name of the sequence, padded out with blanks The sequence follows to the end of the line, vathout embedded blanks There are two reast,ns for avoiding embedded blanks Firstly, tt doubles the amount of sequence that appears on screen at any given time without decreasing legtbihty Secondly, PCWRITE treats ' - ' (used to indicate inserted gaps) as a word-delimiter, so that the 'word-forward' and

4 7 8 9 0 0 0 0 0 ! ! .... ! ' ........ ! ! ...... i ! ........ ~ N I O[-ICr,K - L S ~ N A T A I FFLPD----~TJ~I

5 0 ! ..........

6 0

KDTEE-ED

EATEE-EDFHVIX~TI'VK~I~[,G~'~NI Y H C E K - L S S W v ~ N A T A I

1 0 I'. . . . . I

FFLPD--- ~ T H D Z

SSTI~-RLFHKS

S~SS

D

G

S

~

FNTT~rT~'~u~IY'/DI~

r

~

q

F

~

L'-~0LIS

5FLI DwA'A'A~I~Y--~AL]~EY~-~YS-I~kALALFqLJPK~IqEb'V]~AAMSSKTL

QDTFE-SbF~'I.,D

'ITRHFRDEF~LSCSVLELKY'I~-~I~

PETL

QIYFd~-SRFYLSKIUf/AqqVLqg~LI~IL-TIPYFRDEE-LS ~ I ~ S A L F I S L ~ R - D S F H L D E ~ ~ ~ P E I 0VAHFP FI~-I~qSIr q V L V P T H - F ~ EIYI~A-MP~ES ~ I G L F - - ~ E I ~ ILi~PFASGTRSHLVLLPDE--VF~]LEOLESI INFEI~ e.~T~..mprsmT~.s e~:em(:~ms F--N'VATLPA-mm~ ZLm,mm,S ~ L S ~ L V L U m ~

D R BosweUtsmtheDeparmtentofHaematolog~cal Med~ne, Universityof CambridgeSchool of Chn~al Medrme, Addenbrooke'~ Hospital, Hdls Road, Cambridge CB22QL, UK

Z

~'rEE-AEFINDES DvnHCST-LSS~q,I/~YAG-I~TAVFLLPD---DGIqN~,g~0TLS I~Ll E~HN-~~I'KGNF--I~QE-LDCDIL~LEYVG-GINqLIVVPHK--NS(~gqTLE/~LM ~ ' T R K - E ~ S A S ~ Y Q E G K F--RY~RVAE---GI~HqGDDI TNVLILPKP..--EKSLAh'VEKF-~TPEVL I~TI~I-EPk'IIFIqNSVI KVP/~NSKI~P-VAHFI I~T-LK/~qVC~LQLS-~,SL~LV~ ~ P ~ '.,~~ D - O F ~ " Y ~ - ~ ~/TELp~20,P..Ar4z/,,,/~TDOLGEIV~,LDLSr.VR

~ Z ~

-LTGL-HEF W q l X q S T ~ ~ I I e ~ - b I ~ L G - E S V T L I L I QPQ--CASDLDRVEqLVS~DF -LAEP-QE~nq~S~F--Q~SDIQ-~T-~I0PH--~SDLD~E~T~I~t~ Fig I A segmenl of an ahgnment o/ zequcnces, showmg the amount of mformauon vmble on screen at a gtven ume Here, tl~el,_o.,fourthfrom lhe top has been used tc ~adtcateregaonsof secondary $truclure ~ 1987 ElsevierPubhcat~on~ Cambndge 0376 ¢,(167~7/$tl200 -

280

TIBS 12-July 1987

SERPINS v 7 . 4

Sequence Nutabe r SecSt~uc MAT H~an AIAT Baboon AIAT Nouse Hepcof 2 AT3 CIINH PJ~ ORF1 PAI TBG A1Acr House KIACT ~ A2NP OVCHCK Gene Y P r o t e m Anglo Rat Anglo l~man Z Pcotezn

Fig 2 A secnon Fig I

of,

6 0

9

8 0

o l

4 0 0

I.... I~(RLC4~--N I'QI/CI(K-LSSW~I.2~Y~-HAT~FFI.OD---BGlff..~I l , g ~ . / ~ " ~ NI"YlliCEK- LSb'W~H,!I~.~G-I¢%T~FFLPD---EGF&QH /~'fff,SG~.,--DV~CST- L5b"~'t,U~DY.,g~NATAWb L P D ~ DGI~"IQH PRgTKGNF--LRANDQE-LDCDI ~ I SMLIVV~IK--MS~E4T I~£QEGKF--R~RRVAE--GTC~LPFKGI~ITI4~LILPKP--EKSLAK ~ S K K ~ P - V A H F IDOT-LKAKVGQLQLS-I~LSLVILVP~IL-KERLED E~]K IDTLKKTEI'FTLRNVGYSVTEL ~ L V V P ---DDLGET ~ ¢ N K F N T £ E ~r,'~DG~P~ZDIL ~ N P IAAPYEKEVPLSAL l~g~E~t-- Y A L V I X 4 E - ~ Y S - ~ J ~ L F V L P K - - E G ~ R ~ S ~vdqLI..-TTRI4 FRDEE-LSCSVLELIqTIT,-I%~,~ALL~LP~-----~GRI~Q I % ~ S ~ T I PYFRDEE - ~ K T 1 T ~ N A S A L F I L~I~'~E ' ~ARTYP-LRWFLLEQ.-PE T~e~s/4FPFI~-~S F~VLVPTH-FE~NVSQ I~4~QIG L F - - ~ EE~K ILELPFASGT~MLVLLPDE--VSGLEQ ~ R N S F--NVATLPA-EKMKILELPY#~X~4LVLLPDB--V~GLER M L S G T ~ F ~ S O A Q - N N ~ S ~ I I ~ P L G - F~/TLLLIQPQ--CASDI DR ESGI4GTF~DI Q-DNF~FTELLLIQPH--YASDLDK ,Y ISSSDNLK-VLKLPYAKC4~KRQFSM~ILLPG---AQ~GLW ......

t

! ....

I

l

l ......

:e blocked alignment as printed, corresponding to part of the on-screen display seen m

Table o f i d e n t i t y p e r c e n t a g e s 18 sequences ahgned secstruc AIAT Human ,MJ~,TBaboon A ~ T Rouse nepcof 2 AT3 CllNI/ l~B OI~'1 PAI TBG klACTRouse .MJS.CI'Huron ~ O'~-'I.ICK Gene Y Pcote~n An91o Rat Angxo Human 7. Pcot:eln

Reference Boswell, D R a n d Carrell, R W ( 1 9 8 7 ) R e c Adv Chn lmmunol 4, 1-17

Availability of programs W O R D S T A R is avatlable from most microprocessor dealers The UK retad price ts about £80 PCWRITE is m the pubhc domain, and copies can be found m many umversRles m the UK Try your un|verslty meroprocessor support group Outssde the UK, Eire, Austraha, New Zealand and South Africa the dlstnbutor ~s

Qmcksoft, 219 FwatN #224 Seattle, WA 98109, USA The pnce of reglstrauon is

'word-backword' commands (^F and ^A) skip to the next gap l use the top hne of an al,gnment as a descriptor, and the next three hnes for numbenng The first of these measures the sequence m umts of 100 the hne Is blank up to pos|t,on 119, then has a "1' (for 100) at posmon 120, a '2' at posmon 220 and so forth The second hne ,: dmded into umts of 10 and ts blank to posmon 29, then has a '1' (for 10) at pos. mon 30, a '2' at posmon 40, and so forth up to a '0 ° (for 100) at posmon 120 This pattern repeats to the end of the hne The tMrd has a '0' at posRion 30 and every tenth posmon after that to complete the vemcal numbenng The 'hundreds' and 'tens' lines are parteularly useful for finding a gwen poZlt~on or seann,ng along the ahgnmew, because

0 i 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

7 0 1. . . . . . . .

about 600 In summary, ! find a laboratory nucrw processor a most convement tool for mamtmnlng, adjusting, and pnntmg ahgnments of famdles of homologous sequences

with the cursor on these hnes, 'word-foP ward' and 'word-backward' move 100 or I0 residues at a stroke The sequences follow on hne 4 and subsequent hnes A snapshot of the screen dmsplay dunng the edmng of an ahgnment of serpans t Is shown m Fig 1 A secuon the output from program used for block pnntmg is gwen m Ftg 2 The program also counts residue tdentrees in the ahgned sequences and produces a table of ,dentsty fractmns as shown in Fig 3 1 have compiled and run dlffer|ng versions with TURBOPASCAL and wRh PRO-PASCAL T U R B O - P A S C A L seems unw,lhng to handle text files wtth hnes longer than 1023 characters, but I have not found that restnctmg since the sequences I work w,th have a maximum length of

US$89 with a US$20 overseas shipping charge where apphcable Within these countries, v2 7 and later versions will be marketed by Sagesoft, NE! House,

Regent Centre, Gosforth, Newcasde upon Tyne, NE3 3DS, UK The price will be about £99 I wdl be happy to make copies of the PCWRITE distnbuUon disc, and sourcelanguage and compiled versions of the program that produced Figs 2 and 3 (statable for the IBM PC and compatibles runn,ng PC DOS or MS DOS with or wRhout TURBO-PASCAL) free of charge for any user from a non-profit orgamzatlon who sends two 5,25 inch double-sided 48 tpl dtses and returnaddressed stamped packaging (or alternatwe payment for postage) to me at the address gwen at the foot of p 279

f o r SERPINS v 7 . 4

0

1

2

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

92 61 27 28 22 19 2"1 42 44 ~;7 26 30 30 20 20 22

61 2"/ 28 21 18 26 42 42 36 2"/ 30 30 20 20 22

3

4

5

6

"/ 8

9 10 11 12 13 14 15 16 17

Microfile 28 31 28 20 25 40 43 39 32 33 31 23 21 23

2"/ 17 19 24 28 34 28 23 29 32 17 16 21

20 21 28 2"/ 38 29 2"/ 31 35 16 16 25

18 21 21 30 22 22 19 20 14 15 19

28 21 26 22 21 20 20 13 13 23

25 28 28 27 27 28 19 19 20

42 40 25 26 26 17 18 20

60 34 32 33 25 26 23

25 29 29 19 18 23

27 26 18 18 26

58 19 19 19 16 64 2~ 2"/ 16 15

F:g 3 The table of tderdey fracOons (e:tpresscd as percentagevt generated from the SE.?PIN ahgnment shown m pan m I~gs I and 2 Note that the 'Secondary Structure' ts treated as a sequence, end that it has an tdemtty of O wtth all of the real sequences since only posmons comammg letters (rather thav ~peclal charac. ters) are counted

Software Publishers: Jf you would like us to review your software for teaching or research, contact the Editor at our Cambridge address.

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.