7.3. Coding¶
pasteur policy about the best practices in software development
7.3.1. Don’t reinvent the wheel¶

Before to write any line of code, check if there is no a program or a library which does the work or something very close. If it exists a library which does the work use it, if the library do something close propose the feature that interest you. It' avoid to waste your time. But more important if it's a living project, it's better to use a library well maintained and tested by the community instead to develop your own package.

7.3.2. Convention¶
Follow the conventions. Style guides define formatting and variable naming. This makes it easier for you to understand what you did when you have to return to an analysis years later (e.g. for paper revisions), for others (collaboration).
Use an IDE (PyCharm, VScode, sublime text, spider, ...) it will help you to produce better code (it indicate you when you do a typos, or when you do not follow the convention)
do not put all your stuff in one file. split your project in several files each file keeping together function on same domain
keep your functions simple. very often I see very big function which take a tons of arguments and do very different things in function of the arguments. split this in several function. It will be easier to read, test and maintain.
keep files short. If a file become long (few hundred of lines). split it in several modules.

Important
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
(extract from The Zen of Python)
7.3.3. Outputs¶

Don’t create new standard, just for fun. Think interoperability Below the output of two multiple sequences alignment: ClustalW and ProbCons. These two software declare to produce results in CLUSTALW format.
CLUSTAL
CLUSTAL W (1.82) multiple sequence alignment
MALK_ECOLI MASVQLQNVTKAWGEVVVSKDINLDIHEGEFVVFVGPSGCGKSTL
MALK_SALTY MASVQLRNVTKAWGDVVVSKDINLDIHDGEFVVFVGPSGCGKSTL
MALK_ENTAE MASVQLRNVTKAWGDVVVSKDINLEIQDGEFVVFVGPSGCGKSTL
MALK_PHOLU MSSVTLRNVSKAYGETIISKNINLEIQEGEF--------------
*:** *:**:**:*:.::**:***:*::***
MALK_ECOLI LRMIAGLETITSGDLACRRLHKEPGV
MALK_SALTY LRMIAGLETITSGDLACRRLHQEPGV
MALK_ENTAE LRMIAGLETVTSGDL-----------
MALK_PHOLU LRM-----------------------
***
PROBCONS
PROBCONS version 1.08 multiple sequence alignment
MALK_ECOLI MASVQLQNVTKAWGEVVVSKDINLDIHEGEFVVFVGPSGCGKSTL
MALK_SALTY MASVQLRNVTKAWGDVVVSKDINLDIHDGEFVVFVGPSGCGKSTL
MALK_ENTAE MASVQLRNVTKAWGDVVVSKDINLEIQDGEFVVFVGPSGCGKSTL
MALK_PHOLU MSSVTLRNVSKAYGETIISKNINLEIQEGEF--------------
*:** *:**:**:*:.::**:***:*::***
MALK_ECOLI LRMIAGLETITSGDLACRRLHKEPGV
MALK_SALTY LRMIAGLETITSGDLACRRLHQEPGV
MALK_ENTAE LRMIAGLETVTSGDL-----------
MALK_PHOLU LRM-----------------------
***
The problem is that the word CLUSTAL W is a part of the format definition. So even probcons claim that it generates a clustalw output the format is not recognized as valid alignment by lot of programs.
Your results will be probably read by a human, but certainly used by an other software. Each time you create a output think how I can parse this file?
This a macsyfinder output in text format. This output is suitable for human but very hard to parse
system id = VICH001.B.00001.C001_MSH_1
model = TFF-SF_final/MSH
replicon = VICH001.B.00001.C001
clusters = [('VICH001.B.00001.C001_00406', 'MSH_mshI', 366), ('VICH001.B.00001.C001_00407', 'MSH_mshJ', 367), ('VICH001.B.00001.C001_00408', 'MSH_mshK', 368), ('VICH001.B.00001.C001_00409', '
MSH_mshL', 369), ('VICH001.B.00001.C001_00410', 'MSH_mshM', 370), ('VICH001.B.00001.C001_00411', 'MSH_mshN', 371), ('VICH001.B.00001.C001_00412', 'MSH_mshE', 372), ('VICH001.B.00001.C001_0041
3', 'MSH_mshG', 373), ('VICH001.B.00001.C001_00414', 'MSH_mshF', 374), ('VICH001.B.00001.C001_00415', 'MSH_mshB', 375), ('VICH001.B.00001.C001_00416', 'MSH_mshA', 376), ('VICH001.B.00001.C001
_00417', 'MSH_mshC', 377), ('VICH001.B.00001.C001_00418', 'MSH_mshD', 378), ('VICH001.B.00001.C001_00419', 'MSH_mshO', 379), ('VICH001.B.00001.C001_00420', 'MSH_mshP', 380), ('VICH001.B.00001
.C001_00421', 'MSH_mshQ', 381)]
occ = 1
wholeness = 0.941
loci nb = 1
score = 10.500
mandatory genes:
- MSH_mshA: 1 (MSH_mshA)
- MSH_mshE: 1 (MSH_mshE)
- MSH_mshG: 1 (MSH_mshG)
- MSH_mshL: 1 (MSH_mshL)
- MSH_mshM: 1 (MSH_mshM)
accessory genes:
- MSH_mshB: 1 (MSH_mshB)
- MSH_mshC: 1 (MSH_mshC)
- MSH_mshD: 1 (MSH_mshD)
- MSH_mshF: 1 (MSH_mshF)
- MSH_mshI: 1 (MSH_mshI)
- MSH_mshI2: 0 ()
- MSH_mshJ: 1 (MSH_mshJ)
- MSH_mshK: 1 (MSH_mshK)
- MSH_mshN: 1 (MSH_mshN)
- MSH_mshO: 1 (MSH_mshO)
- MSH_mshQ: 1 (MSH_mshQ)
- MSH_mshP: 1 (MSH_mshP)
neutral genes:
============================================================
system id = VICH001.B.00001.C001_T4P_14
model = TFF-SF_final/T4P
hopefully there is an other output in tabulated separated value format which is easier to parse
replicon hit_id gene_name hit_pos model_fqn sys_id sys_loci locus_num sys_wholeness sys_score sys_occ hit_gene_ref hit_status hit_seq_len hit_i_eval hit_score hit_profile_cov hit_seq_cov hit_begin_match hit_end_match used_in
GCF_000005845 GCF_000005845_000970 T4P_pilC 97 TFF-SF/T4P GCF_000005845_T4P_106 3 1 0.556 7.800 1 T4P_pilC mandatory 400 2.2e-105 353.100 0.991 0.830 62 393
GCF_000005845 GCF_000005845_000980 T4P_pilB 98 TFF-SF/T4P GCF_000005845_T4P_106 3 1 0.556 7.800 1 T4P_pilB mandatory 461 8.9e-152 506.100 0.948 0.850 62 453
GCF_000005845 GCF_000005845_000990 T4P_pilA 99 TFF-SF/T4P GCF_000005845_T4P_106 3 1 0.556 7.800 1 T4P_pilA accessory 146 1.1e-19 71.200 0.859 0.473 5 73
GCF_000005845 GCF_000005845_025680 T4P_pilW 2568 TFF-SF/T4P GCF_000005845_T4P_106 3 2 0.556 7.800 1 T4P_pilW accessory 187 3.3e-08 34.500 0.625 0.401 6 80
GCF_000005845 GCF_000005845_025690 T4P_fimT 2569 TFF-SF/T4P GCF_000005845_T4P_106 3 2 0.556 7.800 1 T4P_fimT accessory 156 2.5e-06 28.500 0.939 0.397 5 66
GCF_000005845 GCF_000005845_026740 T4P_pilT 2674 TFF-SF/T4P GCF_000005845_T4P_106 3 -1 0.556 7.800 1 T4P_pilT mandatory 326 1.1e-117 393.600 0.944 0.979 3 321
GCF_000005845 GCF_000005845_026930 T2SS_gspO 2693 TFF-SF/T4P GCF_000005845_T4P_106 3 -2 0.556 7.800 1 T4P_pilD mandatory 269 1.3e-87 294.000 1.000 0.859 30 260 GCF_000005845_T2SS_83
GCF_000005845 GCF_000005845_030590 T4P_pilQ 3059 TFF-SF/T4P GCF_000005845_T4P_106 3 3 0.556 7.800 1 T4P_pilQ mandatory 412 5.9e-51 173.100 0.919 0.408 244 411
GCF_000005845 GCF_000005845_030620 T4P_pilN 3062 TFF-SF/T4P GCF_000005845_T4P_106 3 3 0.556 7.800 1 T4P_pilN mandatory 179 3.8e-09 37.500 0.986 0.765 5 141
GCF_000005845 GCF_000005845_030630 T4P_pilM 3063 TFF-SF/T4P GCF_000005845_T4P_106 3 3 0.556 7.800 1 T4P_pilM accessory 259 1.1e-09 39.300 0.988 0.598 8 162
to parse this output
import pandas as pd
best_solutions = pd.read_csv('msf_best_solutions.tsv', comment='#', sep= '\t)
Think about the data provenance. add how this result is produced.
# macsyfinder 20210428.dev
# /home/bneron/Projects/GEM/MacSyFinder/MacSyFinder/py37/bin/macsyfinder --db-type=gembase --models-dir=tests/data//models/ --models TFF-SF Archaeal-T4P ComM MSH T2SS T4bP T4P Tad --relative-path --sequence-db tests/data/base/gembase.fasta -w 12 --previous-run tests/data/functional_test_gembase/ -o functional_test_gembase
# Systems found:
replicon hit_id gene_name hit_pos model_fqn sys_id sys_loci locus_num sys_wholeness sys_score sys_occ hit_gene_ref hit_status hit_seq_len hit_i_eval hit_score hit_profile_cov hit_seq_cov hit_begin_match hit_end_match used_in
GCF_000005845 GCF_000005845_000970 T4P_pilC 97 TFF-SF/T4P GCF_000005845_T4P_106 3 1 0.556 7.800 1 T4P_pilC mandatory 400 2.2e-105 353.100 0.991 0.830 62 393
GCF_000005845 GCF_000005845_000980 T4P_pilB 98 TFF-SF/T4P GCF_000005845_T4P_106 3 1 0.556 7.800 1 T4P_pilB mandatory 461 8.9e-152 506.100 0.948 0.850 62 453
GCF_000005845 GCF_000005845_000990 T4P_pilA 99 TFF-SF/T4P GCF_000005845_T4P_106 3 1 0.556 7.800 1 T4P_pilA accessory 146 1.1e-19 71.200 0.859 0.473 5 73
GCF_000005845 GCF_000005845_025680 T4P_pilW 2568 TFF-SF/T4P GCF_000005845_T4P_106 3 2 0.556 7.800 1 T4P_pilW accessory 187 3.3e-08 34.500 0.625 0.401 6 80
GCF_000005845 GCF_000005845_025690 T4P_fimT 2569 TFF-SF/T4P GCF_000005845_T4P_106 3 2 0.556 7.800 1 T4P_fimT accessory 156 2.5e-06 28.500 0.939 0.397 5 66
GCF_000005845 GCF_000005845_026740 T4P_pilT 2674 TFF-SF/T4P GCF_000005845_T4P_106 3 -1 0.556 7.800 1 T4P_pilT mandatory 326 1.1e-117 393.600 0.944 0.979 3 321
GCF_000005845 GCF_000005845_026930 T2SS_gspO 2693 TFF-SF/T4P GCF_000005845_T4P_106 3 -2 0.556 7.800 1 T4P_pilD mandatory 269 1.3e-87 294.000 1.000 0.859 30 260 GCF_000005845_T2SS_83
GCF_000005845 GCF_000005845_030590 T4P_pilQ 3059 TFF-SF/T4P GCF_000005845_T4P_106 3 3 0.556 7.800 1 T4P_pilQ mandatory 412 5.9e-51 173.100 0.919 0.408 244 411
GCF_000005845 GCF_000005845_030620 T4P_pilN 3062 TFF-SF/T4P GCF_000005845_T4P_106 3 3 0.556 7.800 1 T4P_pilN mandatory 179 3.8e-09 37.500 0.986 0.765 5 141
GCF_000005845 GCF_000005845_030630 T4P_pilM 3063 TFF-SF/T4P GCF_000005845_T4P_106 3 3 0.556 7.800 1 T4P_pilM accessory 259 1.1e-09 39.300 0.988 0.598 8 162
For instance in comments at the beginning of the file. It's not a problem for the parsing. But it's easy to know how to reproduce this result.
information file about complete genome
190 255 D CDS ESCO001.C.00001.C001_00001 Valid thrL @b0001@NP_414542.1@ b0001 1 190 255 | leader; Amino acid biosynthesis: Threonine thr operon leader peptide | ..
337 2799 D CDS ESCO001.C.00001.C001_00002 Valid thrA @b0002@NP_414543.1@ b0002 1 337 2799 | enzyme; Amino acid biosynthesis: Threonine Bifunctional aspartokinase/homoserine dehydrogenase 1 | bifunctional: aspartokinase I (N-terminal); homoserine dehydrogenase I (C-terminal) ..
2801 3733 D CDS ESCO001.C.00001.C001_00003 Valid thrB @b0003@NP_414544.1@ b0003 1 2801 3733 | enzyme; Amino acid biosynthesis: Threonine homoserine kinase | ..
3734 5020 D CDS ESCO001.C.00001.C001_00004 Valid thrC @b0004@NP_414545.1@ b0004 1 3734 5020 | enzyme; Amino acid biosynthesis: Threonine L-threonine synthase | ..
5234 5530 D CDS ESCO001.C.00001.C001_00005 Valid yaaX @b0005@NP_414546.1@ b0005 1 5234 5530 | DUF2502 family putative periplasmic protein | ..
5683 6459 C CDS ESCO001.C.00001.C001_00006 Valid yaaA @b0006@NP_414547.1@ b0006 1 5683 6459 | peroxide resistance protein, lowers intracellular iron | ..
6529 7959 C CDS ESCO001.C.00001.C001_00007 Valid yaaJ @b0007@NP_414548.1@ b0007 1 6529 7959 | putative transport; Transport of small molecules: Amino acids, amines putative transporter | inner membrane transport protein ..
information file about draft genome
266 1480 C CDS ACBA.0917.00019.b0001_00001 tyrS | Tyrosine--tRNA ligase | 6.1.1.1 | similar to AA sequence:UniProtKB:P41256
1560 2687 D CDS ACBA.0917.00019.i0001_00002 anmK | Anhydro-N-acetylmuramic acid kinase | 2.7.1.170 | similar to AA sequence:UniProtKB:Q8EHB5
2815 3660 D CDS ACBA.0917.00019.i0001_00003 ephA | Epoxide hydrolase A | 3.3.2.10 | similar to AA sequence:UniProtKB:I6YGS0
3716 4051 C CDS ACBA.0917.00019.i0001_00004 erpA | Iron-sulfur cluster insertion protein ErpA | NA | similar to AA sequence:UniProtKB:P45344
4176 5180 C CDS ACBA.0917.00019.i0001_00005 NA | hypothetical protein | NA | NA
5523 6530 C CDS ACBA.0917.00019.i0001_00006 NA | hypothetical protein | NA | NA
6697 7131 C CDS ACBA.0917.00019.i0001_00007 NA | hypothetical protein | NA | NA
7218 7787 C CDS ACBA.0917.00019.i0001_00008 NA | hypothetical protein | NA | NA
Even the format seems structured, it is not though to be used easily.
What's wrong with these format?
the field separator space is also used in fields (comment)
in case of complete genome there is an extra columns with a key word Valid or Pseudo, ...
so we cannot know easily which format we have, we cannot rely on the columns numbers ...
use a separator which not appear in the fields, adopt a strict syntax (same number of columns fill with None, ...)
7.3.4. Versioning¶
There are several schemes for software versioning:
based on date 2020-08-10, 20200810 , ...
on sequence 1.0, 1.0.1,
development version 1.2.dev1234 a development version of the future 1.2 release. 1234 is the current repository version.
pre-release
alpha releases: early pre-releases. A lot of changes can occur between alphas and the final release, like feature additions or refactorings. But they are minor changes and the software should stay pretty unchanged by the time the first beta is reached.
beta releases: at this stage, no new features are added and developers are tracking remaning bugs.
release candidate: a release candidate is an ultimate release before the final release. Unless something bad happens, nothing is changed.
...
for further details read this guide on versioning specifications
7.3.5. Web site¶
If You intend to develop a web site, read the pasteur security guide And contact the DSI informatique@pasteur.fr to know the deployment rules before you start to code.