Input Formatting ================ .. _smlf: Simple Mutation List Format ---------------- | The smlf format is essential a tab-separated values with three columns. The columns represent: #. Protein identifier #. Amino acid change notation *Optional* #. Tags *Optional* | An example: :: NP_000007.1 R29L label:Pathogenic NP_000007.1 Q45H label:Pathogenic NP_000007.1 R53C label:Pathogenic NP_000007.1 Y67H label:Pathogenic NP_000007.1 Y73C label:Pathogenic NP_000007.1 P74L label:Pathogenic NP_000007.1 I78V label:Pathogenic NP_000007.1 I78T label:Pathogenic NP_000007.1 I78M label:Pathogenic NP_000007.1 A81T label:Pathogenic NP_000007.1 L84F label:Pathogenic NP_000007.1 G85S label:Pathogenic NP_000007.1 G85R label:Pathogenic NP_000007.1 M87T label:Pathogenic NP_000007.1 M87I label:Benign NP_000007.1 D104N label:Pathogenic NP_000007.1 D104G label:Pathogenic NP_000007.1 L107F label:Pathogenic NP_000007.1 A113T label:Pathogenic NP_000007.1 A113D label:Pathogenic NP_000007.1 Y114C label:Pathogenic NP_000007.1 C116Y label:Pathogenic NP_000007.1 G118A label:Pathogenic NP_000007.1 P132H label:Pathogenic NP_000007.1 A136V label:Pathogenic NP_000007.1 R148K label:Pathogenic NP_000007.1 R148I label:Pathogenic NP_000007.1 M149I label:Pathogenic NP_000007.1 M155T label:Pathogenic NP_000007.1 Y158H label:Pathogenic NP_000007.1 C159W label:Benign NP_000007.1 A165T label:Pathogenic NP_000007.1 D168G label:Pathogenic NP_000007.1 I185T label:Pathogenic NP_000007.1 I185M label:Pathogenic .. _fasta: Fasta Format ------------ | StructMAn can process a slightly modified version of fasta-formatted files. The header of each sequence should contain an unique protein identifier without whitespace (preceeded by the typical fasta >-symbol): | ``>Protein_ABC123`` | Between the header and the sequence line can be any number of mutation information lines that start with an <-symbol. .. note:: Those lines are not supported by the standard fasta format. | ``NUDT15_urn:mavedb:00000055-0-1 MTASAQPRGRRPGVGVGVVVTSCKHPRCVLLGKRKGSVGAGSFQLPGGHLEFGETWEECAQRETWEEAALHLKNVHFASVVNSFIEKENYHYVTILMKGEVDVTHDSEPKNVEPEKNESWEWVPWEELPPLDQLFWGLRCLKEQGYDPFKEDLNHLVGYKGNHL >RHO_urn:mavedb:00000099-a-1 MNGTEGPNFYVPFSNATGVVRSPFEYPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLYVTVQHKKLRTPLNYILLNLAVADLFMVLGGFTSTLYTSLHGYFVFGPTGCNLEGFFATLGGEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLAGWSRYIPEGLQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPMIIIFFCYGQLVFTVKEAAAQQQESATTQKAEKEVTRMVIIMVIAFLICWVPYASVAFYIFTHQGSNFGPIFMTIPAFFAKSAAIYNPVIYIMMNKQFRNCMLTTICCGKNPLGDDEASATVSKTETSQVAPA >TEM-1_beta-lactamase_urn:mavedb:00000070-a-3 MSIQHFRVALIPFFAAFCLPVFAHPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRVDAGQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGGPKELTAFLHNMGDHVTRLDRWEPELNEAIPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGSRGIIAALGPDGKPSRIVVIYTTGSQATMDERNRQIAEIGASLIKHW >CcdB_urn:mavedb:00000084-a MQFKVYTYKRESRYRLFVDVQSDIIDTPGRRMVIPLASARLLSDKVSRELYPVVHIGDESWRMMTTDMASVPVSVIGEEVADLSHRENDIKNAINLMFWGI >PTEN_urn:mavedb:00000054-a MTAIIKEIVSRNKRRYQEDGFDLDLTYIYPNIIAMGFPAERLEGVYRNNIDDVVRFLDSKHKNHYKIYNLCAERHYDTAKFNCRVAQYPFEDHNPPQLELIKPFCEDLDQWLSEDDNHVAAIHCKAGKGRTGVMICAYLLHRGKFLKAQEALDFYGEVRTRDKKGVTIPSQRRYVYYYSYLLKNHLDYRPVALLFHKMMFETIPMFSGGTCNPQFVVCQLKVKIYSSNSGPTRREDKFMYFEFPQPLPVCGDIKVEFFHKQNKMLKKDKMFHFWVNTFFIPGPEETSEKVENGSLCDQEIDSICSIERADNDKEYLVLTLTKNDLDKANKDKANRYFSPNFKVKLYFTKTVEEPSNPEASSSTSVTPDVSDNEPDHYRYSDTTDSDPENEPFDEDQHTQITKV >p53_urn:mavedb:00000059-a MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD .. _structguy_inputs: Inputs dedicated for further procession with StructGuy ---------------------------------------- | Generally all StructMAn readable formats can be used with StructGuy for prediction. For the featurization of a training dataset, effect values need to be assigned for each mutation and can be done with the following formats: .. _smlf_sg: Simple Mutation List Format for StructGuy Training ---------------- | The effect values are assigned using the tags column with the general form: `#effect_name:float`. The `effect_name` can be any normal text-string, but should be consistent throughout the file. `float` needs to be a floating point number. | An example: :: A0A140D2T1 I291A #effect:0.0302686862600679 A0A140D2T1 I291Y #effect:0.048604163054757 A0A140D2T1 I291W #effect:0.0936416537015749 A0A140D2T1 I291V #effect:0.626746537538822 A0A140D2T1 I291T #effect:1.76206628831371 A0A140D2T1 I291S #effect:0.017235384036755 A0A140D2T1 I291R #effect:0.008824236279643 A0A140D2T1 I291Q #effect:0.064509419785469 A0A140D2T1 I291P #effect:0.006903087964492 A0A140D2T1 I291C #effect:0.050460709116445 A0A140D2T1 I291M #effect:1.55445997909363 A0A140D2T1 I291D #effect:0.067708005015844 A0A140D2T1 I291E #effect:0.051098705641169 A0A140D2T1 I291F #effect:0.048359762811103 A0A140D2T1 I291N #effect:0.047549277753243 A0A140D2T1 I291H #effect:0.018586107041444 A0A140D2T1 I291K #effect:0.021642108328451 A0A140D2T1 I291L #effect:0.024099011187566 A0A140D2T1 I291G #effect:0.023833827356767 .. _fasta_sg: Fasta Format for StructGuy Training ------------ | The effect values are assigned using the special `<` lines with the general form: `<[AA1][POS][AA2] #effect_name:float`. `AA1` is the wildtype amino acid, `POS` is the position number of the amino acid, `AA1` is the mutant amino acid. The `effect_name` can be any normal text-string, but should be consistent throughout the file. `float` needs to be a floating point number. | An example: :: >A0A140D2T1_ZIKV_Sourisseau_2019