Input Formatting

Simple Mutation List Format

The smlf format is essential a tab-separated values with three columns. The columns represent:
  1. Protein identifier

  2. Amino acid change notation Optional

  3. Tags Optional

An example:
NP_000007.1     R29L    label:Pathogenic
NP_000007.1     Q45H    label:Pathogenic
NP_000007.1     R53C    label:Pathogenic
NP_000007.1     Y67H    label:Pathogenic
NP_000007.1     Y73C    label:Pathogenic
NP_000007.1     P74L    label:Pathogenic
NP_000007.1     I78V    label:Pathogenic
NP_000007.1     I78T    label:Pathogenic
NP_000007.1     I78M    label:Pathogenic
NP_000007.1     A81T    label:Pathogenic
NP_000007.1     L84F    label:Pathogenic
NP_000007.1     G85S    label:Pathogenic
NP_000007.1     G85R    label:Pathogenic
NP_000007.1     M87T    label:Pathogenic
NP_000007.1     M87I    label:Benign
NP_000007.1     D104N   label:Pathogenic
NP_000007.1     D104G   label:Pathogenic
NP_000007.1     L107F   label:Pathogenic
NP_000007.1     A113T   label:Pathogenic
NP_000007.1     A113D   label:Pathogenic
NP_000007.1     Y114C   label:Pathogenic
NP_000007.1     C116Y   label:Pathogenic
NP_000007.1     G118A   label:Pathogenic
NP_000007.1     P132H   label:Pathogenic
NP_000007.1     A136V   label:Pathogenic
NP_000007.1     R148K   label:Pathogenic
NP_000007.1     R148I   label:Pathogenic
NP_000007.1     M149I   label:Pathogenic
NP_000007.1     M155T   label:Pathogenic
NP_000007.1     Y158H   label:Pathogenic
NP_000007.1     C159W   label:Benign
NP_000007.1     A165T   label:Pathogenic
NP_000007.1     D168G   label:Pathogenic
NP_000007.1     I185T   label:Pathogenic
NP_000007.1     I185M   label:Pathogenic

Fasta Format

StructMAn can process a slightly modified version of fasta-formatted files. The header of each sequence should contain an unique protein identifier without whitespace (preceeded by the typical fasta >-symbol):
>Protein_ABC123
Between the header and the sequence line can be any number of mutation information lines that start with an <-symbol.

Note

Those lines are not supported by the standard fasta format.

<A23G [Tags-String]
The sequence line works identical to the standard fasta format by simply providing an amino acid sequence in one-letter code:
MAGGHKLMRAARAFTP
A complete example:
>NUDT15_urn:mavedb:00000055-0-1
MTASAQPRGRRPGVGVGVVVTSCKHPRCVLLGKRKGSVGAGSFQLPGGHLEFGETWEECAQRETWEEAALHLKNVHFASVVNSFIEKENYHYVTILMKGEVDVTHDSEPKNVEPEKNESWEWVPWEELPPLDQLFWGLRCLKEQGYDPFKEDLNHLVGYKGNHL
>RHO_urn:mavedb:00000099-a-1
MNGTEGPNFYVPFSNATGVVRSPFEYPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLYVTVQHKKLRTPLNYILLNLAVADLFMVLGGFTSTLYTSLHGYFVFGPTGCNLEGFFATLGGEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLAGWSRYIPEGLQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPMIIIFFCYGQLVFTVKEAAAQQQESATTQKAEKEVTRMVIIMVIAFLICWVPYASVAFYIFTHQGSNFGPIFMTIPAFFAKSAAIYNPVIYIMMNKQFRNCMLTTICCGKNPLGDDEASATVSKTETSQVAPA
>TEM-1_beta-lactamase_urn:mavedb:00000070-a-3
MSIQHFRVALIPFFAAFCLPVFAHPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRVDAGQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGGPKELTAFLHNMGDHVTRLDRWEPELNEAIPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGSRGIIAALGPDGKPSRIVVIYTTGSQATMDERNRQIAEIGASLIKHW
>CcdB_urn:mavedb:00000084-a
MQFKVYTYKRESRYRLFVDVQSDIIDTPGRRMVIPLASARLLSDKVSRELYPVVHIGDESWRMMTTDMASVPVSVIGEEVADLSHRENDIKNAINLMFWGI
>PTEN_urn:mavedb:00000054-a
MTAIIKEIVSRNKRRYQEDGFDLDLTYIYPNIIAMGFPAERLEGVYRNNIDDVVRFLDSKHKNHYKIYNLCAERHYDTAKFNCRVAQYPFEDHNPPQLELIKPFCEDLDQWLSEDDNHVAAIHCKAGKGRTGVMICAYLLHRGKFLKAQEALDFYGEVRTRDKKGVTIPSQRRYVYYYSYLLKNHLDYRPVALLFHKMMFETIPMFSGGTCNPQFVVCQLKVKIYSSNSGPTRREDKFMYFEFPQPLPVCGDIKVEFFHKQNKMLKKDKMFHFWVNTFFIPGPEETSEKVENGSLCDQEIDSICSIERADNDKEYLVLTLTKNDLDKANKDKANRYFSPNFKVKLYFTKTVEEPSNPEASSSTSVTPDVSDNEPDHYRYSDTTDSDPENEPFDEDQHTQITKV
>p53_urn:mavedb:00000059-a
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD

Inputs dedicated for further procession with StructGuy

Generally all StructMAn readable formats can be used with StructGuy for prediction. For the featurization of a training dataset, effect values need to be assigned for each mutation and can be done with the following formats:

Simple Mutation List Format for StructGuy Training

The effect values are assigned using the tags column with the general form: #effect_name:float. The effect_name can be any normal text-string, but should be consistent throughout the file. float needs to be a floating point number.
An example:
A0A140D2T1  I291A  #effect:0.0302686862600679
A0A140D2T1  I291Y  #effect:0.048604163054757
A0A140D2T1  I291W  #effect:0.0936416537015749
A0A140D2T1  I291V  #effect:0.626746537538822
A0A140D2T1  I291T  #effect:1.76206628831371
A0A140D2T1  I291S  #effect:0.017235384036755
A0A140D2T1  I291R  #effect:0.008824236279643
A0A140D2T1  I291Q  #effect:0.064509419785469
A0A140D2T1  I291P  #effect:0.006903087964492
A0A140D2T1  I291C  #effect:0.050460709116445
A0A140D2T1  I291M  #effect:1.55445997909363
A0A140D2T1  I291D  #effect:0.067708005015844
A0A140D2T1  I291E  #effect:0.051098705641169
A0A140D2T1  I291F  #effect:0.048359762811103
A0A140D2T1  I291N  #effect:0.047549277753243
A0A140D2T1  I291H  #effect:0.018586107041444
A0A140D2T1  I291K  #effect:0.021642108328451
A0A140D2T1  I291L  #effect:0.024099011187566
A0A140D2T1  I291G  #effect:0.023833827356767

Fasta Format for StructGuy Training

The effect values are assigned using the special < lines with the general form: <[AA1][POS][AA2] #effect_name:float. AA1 is the wildtype amino acid, POS is the position number of the amino acid, AA1 is the mutant amino acid. The effect_name can be any normal text-string, but should be consistent throughout the file. float needs to be a floating point number.
An example:
>A0A140D2T1_ZIKV_Sourisseau_2019
<I291A #effect:0.0302686862600679
<I291Y #effect:0.048604163054757
<I291W #effect:0.0936416537015749
<I291V #effect:0.626746537538822
<I291T #effect:1.76206628831371
<I291S #effect:0.017235384036755
<I291R #effect:0.008824236279643
<I291Q #effect:0.064509419785469
<I291P #effect:0.006903087964492
<I291C #effect:0.050460709116445
<I291M #effect:1.55445997909363
<I291D #effect:0.067708005015844
<I291E #effect:0.051098705641169
<I291F #effect:0.048359762811103
<I291N #effect:0.047549277753243
<I291H #effect:0.018586107041444
<I291K #effect:0.021642108328451
<I291L #effect:0.024099011187566
MKNPKKKSGGFRIVNMLKRGVARVNPLGGLKRLPAGLLLGHGPIRMVLAILAFLRFTAIKPSLGLINRWGSVGKKEAMEIIKKFKKDLAAMLRIINARKERKRRGADTSIGIIGLLLTTAMAAEITRRGSAYYMYLDRSD
AGKAISFATTLGVNKCHVQIMDLGHMCDATMSYECPMLDEGVEPDDVDCWCNTTSTWVVYGTCHHKKGEARRSRRAVTLPSHSTRKLQTRSQTWLESREYTKHLIKVENWIFRNPGFALVAVAIAWLLGSSTSQKVIYLV
MILLIAPAYSIRCIGVSNRDFVEGMSGGTWVDVVLEHGGCVTVMAQDKPTVDIELVTTTVSNMAEVRSYCYEASISDMASDSRCPTQGEAYLDKQSDTQYVCKRTLVDRGWGNGCGLFGKGSLVTCAKFTCSKKMTGKSI
QPENLEYRIMLSVHGSQHSGMIVNDTGYETDENRAKVEVTPNSPRAEATLGGFGSLGLDCEPRTGLDFSDLYYLTMNNKHWLVHKEWFHDIPLPWHAGADTGTPHWNNKEALVEFKDAHAKRQTVVVLGSQEGAVHTALA
GALEAEMDGAKGKLFSGHLKCRLKMDKLRLKGVSYSLCTAAFTFTKVPAETLHGTVTVEVQYAGTDGPCKIPVQMAVDMQTLTPVGRLITANPVITESTENSKMMLELDPPFGDSYIVIGVGDKKITHHWHRSGSTIGKA
FEATVRGAKRMAVLGDTAWDFGSVGGVFNSLGKGIHQIFGAAFKSLFGGMSWFSQILIGTLLVWLGLNTKNGSISLTCLALGGVMIFLSTAVSADVGCSVDFSKKETRCGTGVFIYNDVEAWRDRYKYHPDSPRRLAAAV
KQAWEEGICGISSVSRMENIMWKSVEGELNAILEENGVQLTVVVGSVKNPMWRGPQRLPVPVNELPHGWKAWGKSYFVRAAKTNNSFVVDGDTLKECPLEHRAWNSFLVEDHGFGVFHTSVWLKVREDYSLECDPAVIGT
AVKGREAAHSDLGYWIESEKNDTWRLKRAHLIEMKTCEWPKSHTLWTDGVEESDLIIPKSLAGPLSHHNTREGYRTQVKGPWHSEELEIRFEECPGTKVYVEETCGTRGPSLRSTTASGRVIEEWCCRECTMPPLSFRAK
DGCWYGMEIRPRKEPESNLVRSMVTAGSTDHMDHFSLGVLVILLMVQEGLKKRMTTKIIMSTSMAVLVVMILGGFSMSDLAKLVILMGATFAEMNTGGDVAHLALVAAFKVRPALLVSFIFRANWTPRESMLLALASCLL
QTAISALEGDLMVLINGFALAWLAIRAMAVPRTDNIALPILAALTPLARGTLLVAWRAGLATCGGIMLLSLKGKGSVKKNLPFVMALGLTAVRVVDPINVVGLLLLTRSGKRSWPPSEVLTAVGLICALAGGFAKADIEM
AGPMAAVGLLIVSYVVSGKSVDMYIERAGDITWEKDAEVTGNSPRLDVALDESGDFSLVEEDGPPMREIILKVVLMAICGMNPIAIPFAAGAWYVYVKTGKRSGALWDVPAPKEVKKGETTDGVYRVMTRRLLGSTQVGV
GVMQEGVFHTMWHVTKGAALRSGEGRLDPYWGDVKQDLVSYCGPWKLDAAWDGLSEVQLLAVPPGERARNIQTLPGIFKTKDGDIGAVALDYPAGTSGSPILDKCGRVIGLYGNGVVIKNGSYVSAITQGKREEETPVEC
FEPSMLKKKQLTVLDLHPGAGKTRRVLPEIVREAIKKRLRTVILAPTRVVAAEMEEALRGLPVRYMTTAVNVTHSGTEIVDLMCHATFTSRLLQPIRVPNYNLYIMDEAHFTDPSSIAARGYISTRVEMGEAAAIFMTAT
PPGTRDAFPDSNSPIMDTEVEVPERAWSSGFDWVTDHSGKTVWFVPSVRNGNEIAACLTKAGKRVIQLSRKTFETEFQKTKNQEWDFVITTDISEMGANFKADRVIDSRRCLKPVILDGERVILAGPMPVTHASAAQRRG
RIGRNPNKPGDEYMYGGGCAETDEGHAHWLEARMLLDNIYLQDGLIASLYRPEADKVAAIEGEFKLRTEQRKTFVELMKRGDLPVWLAYQVASAGITYTDRRWCFDGTTNNTIMEDSVPAEVWTKYGEKRVLKPRWMDAR
VCSDHAALKSFKEFAAGKRGAALGVMEALGTLPGHMTERFQEAIDNLAVLMRAETGSRPYKAAAAQLPETLETIMLLGLLGTVSLGIFFVLMRNKGIGKMGFGMVTLGASAWLMWLSEIEPARIACVLIVVFLLLVVLIP
EPEKQRSPQDNQMAIIIMVAVGLLGLITANELGWLERTKNDIAHLMGRREEGATMGFSMDIDLRPASAWAIYAALTTLITPAVQHAVTTSYNNYSLMAMATQAGVLFGMGKGMPFYAWDLGVPLLMMGCYSQLTPLTLIV
AIILLVAHYMYLIPGLQAAAARAAQKRTAAGIMKNPVVDGIVVTDIDTMTIDPQVEKKMGQVLLIAVAISSAVLLRTAWGWGEAGALITAATSTLWEGSPNKYWNSSTATSLCNIFRGSYLAGASLIYTVTRNAGLVKRR
GGGTGETLGEKWKARLNQMSALEFYSYKKSGITEVCREEARRALKDGVATGGHAVSRGSAKLRWLVERGYLQPYGKVVDLGCGRGGWSYYAATIRKVQEVRGYTKGGPGHEEPMLVQSYGWNIVRLKSGVDVFHMAAEPC
DTLLCDIGESSSSPEVEETRTLRVLSMVGDWLEKRPGAFCIKVLCPYTSTMMETMERLQRRHGGGLVRVPLSRNSTHEMYWVSGAKSNIIKSVSTTSQLLLGRMDGPRRPVKYEEDVNLGSGTRAVASCAEAPNMKIIGR
RIERIRNEHAETWFLDENHPYRTWAYHGSYEAPTQGSASSLVNGVVRLLSKPWDVVTGVTGIAMTDTTPYGQQRVFKEKVDTRVPDPQEGTRQVMNIVSSWLWKELGKRKRPRVCTKEEFINKVRSNAALGAIFEEEKEW
KTAVEAVNDPRFWALVDREREHHLRGECHSCVYNMMGKREKKQGEFGKAKGSRAIWYMWLGARFLEFEALGFLNEDHWMGRENSGGGVEGLGLQRLGYILEEMNRAPGGKMYADDTAGWDTRISKFDLENEALITNQMEE
GHRTLALAVIKYTYQNKVVKVLRPAEGGKTVMDIISRQDQRGSGQVVTYALNTFTNLVVQLIRNMEAEEVLEMQDLWLLRKPEKVTRWLQSNGWDRLKRMAVSGDDCVVKPIDDRFAHALRFLNDMGKVRKDTQEWKPST
GWSNWEEVPFCSHHFNKLYLKDGRSIVVPCRHQDELIGRARVSPGAGWSIRETACLAKSYAQMWQLLYFHRRDLRLMANAICSAVPVDWVPTGRTTWSIHGKGEWMTTEDMLMVWNRVWIEENDHMEDKTPVTKWTDIPY
LGKREDLWCGSLIGHRPRTTWAENIKDTVNMVRRIIGDEEKYMDYLSTQVRYLGEEGSTPGVL