Input Formatting
Simple Mutation List Format
The smlf format is essential a tab-separated values with three columns. The columns represent:
Protein identifier
Amino acid change notation Optional
Tags Optional
An example:
NP_000007.1 R29L label:Pathogenic
NP_000007.1 Q45H label:Pathogenic
NP_000007.1 R53C label:Pathogenic
NP_000007.1 Y67H label:Pathogenic
NP_000007.1 Y73C label:Pathogenic
NP_000007.1 P74L label:Pathogenic
NP_000007.1 I78V label:Pathogenic
NP_000007.1 I78T label:Pathogenic
NP_000007.1 I78M label:Pathogenic
NP_000007.1 A81T label:Pathogenic
NP_000007.1 L84F label:Pathogenic
NP_000007.1 G85S label:Pathogenic
NP_000007.1 G85R label:Pathogenic
NP_000007.1 M87T label:Pathogenic
NP_000007.1 M87I label:Benign
NP_000007.1 D104N label:Pathogenic
NP_000007.1 D104G label:Pathogenic
NP_000007.1 L107F label:Pathogenic
NP_000007.1 A113T label:Pathogenic
NP_000007.1 A113D label:Pathogenic
NP_000007.1 Y114C label:Pathogenic
NP_000007.1 C116Y label:Pathogenic
NP_000007.1 G118A label:Pathogenic
NP_000007.1 P132H label:Pathogenic
NP_000007.1 A136V label:Pathogenic
NP_000007.1 R148K label:Pathogenic
NP_000007.1 R148I label:Pathogenic
NP_000007.1 M149I label:Pathogenic
NP_000007.1 M155T label:Pathogenic
NP_000007.1 Y158H label:Pathogenic
NP_000007.1 C159W label:Benign
NP_000007.1 A165T label:Pathogenic
NP_000007.1 D168G label:Pathogenic
NP_000007.1 I185T label:Pathogenic
NP_000007.1 I185M label:Pathogenic
Fasta Format
StructMAn can process a slightly modified version of fasta-formatted files. The header of each sequence should contain an unique protein identifier without whitespace (preceeded by the typical fasta >-symbol):
>Protein_ABC123
Between the header and the sequence line can be any number of mutation information lines that start with an <-symbol.
Note
Those lines are not supported by the standard fasta format.
<A23G [Tags-String]
The sequence line works identical to the standard fasta format by simply providing an amino acid sequence in one-letter code:
MAGGHKLMRAARAFTP
A complete example:
>NUDT15_urn:mavedb:00000055-0-1
MTASAQPRGRRPGVGVGVVVTSCKHPRCVLLGKRKGSVGAGSFQLPGGHLEFGETWEECAQRETWEEAALHLKNVHFASVVNSFIEKENYHYVTILMKGEVDVTHDSEPKNVEPEKNESWEWVPWEELPPLDQLFWGLRCLKEQGYDPFKEDLNHLVGYKGNHL
>RHO_urn:mavedb:00000099-a-1
MNGTEGPNFYVPFSNATGVVRSPFEYPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLYVTVQHKKLRTPLNYILLNLAVADLFMVLGGFTSTLYTSLHGYFVFGPTGCNLEGFFATLGGEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLAGWSRYIPEGLQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPMIIIFFCYGQLVFTVKEAAAQQQESATTQKAEKEVTRMVIIMVIAFLICWVPYASVAFYIFTHQGSNFGPIFMTIPAFFAKSAAIYNPVIYIMMNKQFRNCMLTTICCGKNPLGDDEASATVSKTETSQVAPA
>TEM-1_beta-lactamase_urn:mavedb:00000070-a-3
MSIQHFRVALIPFFAAFCLPVFAHPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRVDAGQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGGPKELTAFLHNMGDHVTRLDRWEPELNEAIPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGSRGIIAALGPDGKPSRIVVIYTTGSQATMDERNRQIAEIGASLIKHW
>CcdB_urn:mavedb:00000084-a
MQFKVYTYKRESRYRLFVDVQSDIIDTPGRRMVIPLASARLLSDKVSRELYPVVHIGDESWRMMTTDMASVPVSVIGEEVADLSHRENDIKNAINLMFWGI
>PTEN_urn:mavedb:00000054-a
MTAIIKEIVSRNKRRYQEDGFDLDLTYIYPNIIAMGFPAERLEGVYRNNIDDVVRFLDSKHKNHYKIYNLCAERHYDTAKFNCRVAQYPFEDHNPPQLELIKPFCEDLDQWLSEDDNHVAAIHCKAGKGRTGVMICAYLLHRGKFLKAQEALDFYGEVRTRDKKGVTIPSQRRYVYYYSYLLKNHLDYRPVALLFHKMMFETIPMFSGGTCNPQFVVCQLKVKIYSSNSGPTRREDKFMYFEFPQPLPVCGDIKVEFFHKQNKMLKKDKMFHFWVNTFFIPGPEETSEKVENGSLCDQEIDSICSIERADNDKEYLVLTLTKNDLDKANKDKANRYFSPNFKVKLYFTKTVEEPSNPEASSSTSVTPDVSDNEPDHYRYSDTTDSDPENEPFDEDQHTQITKV
>p53_urn:mavedb:00000059-a
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD
Inputs dedicated for further procession with StructGuy
Generally all StructMAn readable formats can be used with StructGuy for prediction. For the featurization of a training dataset, effect values need to be assigned for each mutation and can be done with the following formats:
Simple Mutation List Format for StructGuy Training
The effect values are assigned using the tags column with the general form: #effect_name:float. The effect_name can be any normal text-string, but should be consistent throughout the file. float needs to be a floating point number.
An example:
A0A140D2T1 I291A #effect:0.0302686862600679
A0A140D2T1 I291Y #effect:0.048604163054757
A0A140D2T1 I291W #effect:0.0936416537015749
A0A140D2T1 I291V #effect:0.626746537538822
A0A140D2T1 I291T #effect:1.76206628831371
A0A140D2T1 I291S #effect:0.017235384036755
A0A140D2T1 I291R #effect:0.008824236279643
A0A140D2T1 I291Q #effect:0.064509419785469
A0A140D2T1 I291P #effect:0.006903087964492
A0A140D2T1 I291C #effect:0.050460709116445
A0A140D2T1 I291M #effect:1.55445997909363
A0A140D2T1 I291D #effect:0.067708005015844
A0A140D2T1 I291E #effect:0.051098705641169
A0A140D2T1 I291F #effect:0.048359762811103
A0A140D2T1 I291N #effect:0.047549277753243
A0A140D2T1 I291H #effect:0.018586107041444
A0A140D2T1 I291K #effect:0.021642108328451
A0A140D2T1 I291L #effect:0.024099011187566
A0A140D2T1 I291G #effect:0.023833827356767
Fasta Format for StructGuy Training
The effect values are assigned using the special < lines with the general form: <[AA1][POS][AA2] #effect_name:float. AA1 is the wildtype amino acid, POS is the position number of the amino acid, AA1 is the mutant amino acid. The effect_name can be any normal text-string, but should be consistent throughout the file. float needs to be a floating point number.
An example:
>A0A140D2T1_ZIKV_Sourisseau_2019
<I291A #effect:0.0302686862600679
<I291Y #effect:0.048604163054757
<I291W #effect:0.0936416537015749
<I291V #effect:0.626746537538822
<I291T #effect:1.76206628831371
<I291S #effect:0.017235384036755
<I291R #effect:0.008824236279643
<I291Q #effect:0.064509419785469
<I291P #effect:0.006903087964492
<I291C #effect:0.050460709116445
<I291M #effect:1.55445997909363
<I291D #effect:0.067708005015844
<I291E #effect:0.051098705641169
<I291F #effect:0.048359762811103
<I291N #effect:0.047549277753243
<I291H #effect:0.018586107041444
<I291K #effect:0.021642108328451
<I291L #effect:0.024099011187566
MKNPKKKSGGFRIVNMLKRGVARVNPLGGLKRLPAGLLLGHGPIRMVLAILAFLRFTAIKPSLGLINRWGSVGKKEAMEIIKKFKKDLAAMLRIINARKERKRRGADTSIGIIGLLLTTAMAAEITRRGSAYYMYLDRSD
AGKAISFATTLGVNKCHVQIMDLGHMCDATMSYECPMLDEGVEPDDVDCWCNTTSTWVVYGTCHHKKGEARRSRRAVTLPSHSTRKLQTRSQTWLESREYTKHLIKVENWIFRNPGFALVAVAIAWLLGSSTSQKVIYLV
MILLIAPAYSIRCIGVSNRDFVEGMSGGTWVDVVLEHGGCVTVMAQDKPTVDIELVTTTVSNMAEVRSYCYEASISDMASDSRCPTQGEAYLDKQSDTQYVCKRTLVDRGWGNGCGLFGKGSLVTCAKFTCSKKMTGKSI
QPENLEYRIMLSVHGSQHSGMIVNDTGYETDENRAKVEVTPNSPRAEATLGGFGSLGLDCEPRTGLDFSDLYYLTMNNKHWLVHKEWFHDIPLPWHAGADTGTPHWNNKEALVEFKDAHAKRQTVVVLGSQEGAVHTALA
GALEAEMDGAKGKLFSGHLKCRLKMDKLRLKGVSYSLCTAAFTFTKVPAETLHGTVTVEVQYAGTDGPCKIPVQMAVDMQTLTPVGRLITANPVITESTENSKMMLELDPPFGDSYIVIGVGDKKITHHWHRSGSTIGKA
FEATVRGAKRMAVLGDTAWDFGSVGGVFNSLGKGIHQIFGAAFKSLFGGMSWFSQILIGTLLVWLGLNTKNGSISLTCLALGGVMIFLSTAVSADVGCSVDFSKKETRCGTGVFIYNDVEAWRDRYKYHPDSPRRLAAAV
KQAWEEGICGISSVSRMENIMWKSVEGELNAILEENGVQLTVVVGSVKNPMWRGPQRLPVPVNELPHGWKAWGKSYFVRAAKTNNSFVVDGDTLKECPLEHRAWNSFLVEDHGFGVFHTSVWLKVREDYSLECDPAVIGT
AVKGREAAHSDLGYWIESEKNDTWRLKRAHLIEMKTCEWPKSHTLWTDGVEESDLIIPKSLAGPLSHHNTREGYRTQVKGPWHSEELEIRFEECPGTKVYVEETCGTRGPSLRSTTASGRVIEEWCCRECTMPPLSFRAK
DGCWYGMEIRPRKEPESNLVRSMVTAGSTDHMDHFSLGVLVILLMVQEGLKKRMTTKIIMSTSMAVLVVMILGGFSMSDLAKLVILMGATFAEMNTGGDVAHLALVAAFKVRPALLVSFIFRANWTPRESMLLALASCLL
QTAISALEGDLMVLINGFALAWLAIRAMAVPRTDNIALPILAALTPLARGTLLVAWRAGLATCGGIMLLSLKGKGSVKKNLPFVMALGLTAVRVVDPINVVGLLLLTRSGKRSWPPSEVLTAVGLICALAGGFAKADIEM
AGPMAAVGLLIVSYVVSGKSVDMYIERAGDITWEKDAEVTGNSPRLDVALDESGDFSLVEEDGPPMREIILKVVLMAICGMNPIAIPFAAGAWYVYVKTGKRSGALWDVPAPKEVKKGETTDGVYRVMTRRLLGSTQVGV
GVMQEGVFHTMWHVTKGAALRSGEGRLDPYWGDVKQDLVSYCGPWKLDAAWDGLSEVQLLAVPPGERARNIQTLPGIFKTKDGDIGAVALDYPAGTSGSPILDKCGRVIGLYGNGVVIKNGSYVSAITQGKREEETPVEC
FEPSMLKKKQLTVLDLHPGAGKTRRVLPEIVREAIKKRLRTVILAPTRVVAAEMEEALRGLPVRYMTTAVNVTHSGTEIVDLMCHATFTSRLLQPIRVPNYNLYIMDEAHFTDPSSIAARGYISTRVEMGEAAAIFMTAT
PPGTRDAFPDSNSPIMDTEVEVPERAWSSGFDWVTDHSGKTVWFVPSVRNGNEIAACLTKAGKRVIQLSRKTFETEFQKTKNQEWDFVITTDISEMGANFKADRVIDSRRCLKPVILDGERVILAGPMPVTHASAAQRRG
RIGRNPNKPGDEYMYGGGCAETDEGHAHWLEARMLLDNIYLQDGLIASLYRPEADKVAAIEGEFKLRTEQRKTFVELMKRGDLPVWLAYQVASAGITYTDRRWCFDGTTNNTIMEDSVPAEVWTKYGEKRVLKPRWMDAR
VCSDHAALKSFKEFAAGKRGAALGVMEALGTLPGHMTERFQEAIDNLAVLMRAETGSRPYKAAAAQLPETLETIMLLGLLGTVSLGIFFVLMRNKGIGKMGFGMVTLGASAWLMWLSEIEPARIACVLIVVFLLLVVLIP
EPEKQRSPQDNQMAIIIMVAVGLLGLITANELGWLERTKNDIAHLMGRREEGATMGFSMDIDLRPASAWAIYAALTTLITPAVQHAVTTSYNNYSLMAMATQAGVLFGMGKGMPFYAWDLGVPLLMMGCYSQLTPLTLIV
AIILLVAHYMYLIPGLQAAAARAAQKRTAAGIMKNPVVDGIVVTDIDTMTIDPQVEKKMGQVLLIAVAISSAVLLRTAWGWGEAGALITAATSTLWEGSPNKYWNSSTATSLCNIFRGSYLAGASLIYTVTRNAGLVKRR
GGGTGETLGEKWKARLNQMSALEFYSYKKSGITEVCREEARRALKDGVATGGHAVSRGSAKLRWLVERGYLQPYGKVVDLGCGRGGWSYYAATIRKVQEVRGYTKGGPGHEEPMLVQSYGWNIVRLKSGVDVFHMAAEPC
DTLLCDIGESSSSPEVEETRTLRVLSMVGDWLEKRPGAFCIKVLCPYTSTMMETMERLQRRHGGGLVRVPLSRNSTHEMYWVSGAKSNIIKSVSTTSQLLLGRMDGPRRPVKYEEDVNLGSGTRAVASCAEAPNMKIIGR
RIERIRNEHAETWFLDENHPYRTWAYHGSYEAPTQGSASSLVNGVVRLLSKPWDVVTGVTGIAMTDTTPYGQQRVFKEKVDTRVPDPQEGTRQVMNIVSSWLWKELGKRKRPRVCTKEEFINKVRSNAALGAIFEEEKEW
KTAVEAVNDPRFWALVDREREHHLRGECHSCVYNMMGKREKKQGEFGKAKGSRAIWYMWLGARFLEFEALGFLNEDHWMGRENSGGGVEGLGLQRLGYILEEMNRAPGGKMYADDTAGWDTRISKFDLENEALITNQMEE
GHRTLALAVIKYTYQNKVVKVLRPAEGGKTVMDIISRQDQRGSGQVVTYALNTFTNLVVQLIRNMEAEEVLEMQDLWLLRKPEKVTRWLQSNGWDRLKRMAVSGDDCVVKPIDDRFAHALRFLNDMGKVRKDTQEWKPST
GWSNWEEVPFCSHHFNKLYLKDGRSIVVPCRHQDELIGRARVSPGAGWSIRETACLAKSYAQMWQLLYFHRRDLRLMANAICSAVPVDWVPTGRTTWSIHGKGEWMTTEDMLMVWNRVWIEENDHMEDKTPVTKWTDIPY
LGKREDLWCGSLIGHRPRTTWAENIKDTVNMVRRIIGDEEKYMDYLSTQVRYLGEEGSTPGVL