7.3. Coding

pasteur policy about the best practices in software development

7.3.1. Don’t reinvent the wheel

../_images/dont_reinvent_the_wheel.jpeg

Before to write any line of code, check if there is no a program or a library which does the work or something very close. If it exists a library which does the work use it, if the library do something close propose the feature that interest you. It' avoid to waste your time. But more important if it's a living project, it's better to use a library well maintained and tested by the community instead to develop your own package.

../_images/spacer.png

7.3.2. Convention

Follow the conventions. Style guides define formatting and variable naming. This makes it easier for you to understand what you did when you have to return to an analysis years later (e.g. for paper revisions), for others (collaboration).

Use an IDE (PyCharm, VScode, sublime text, spider, ...) it will help you to produce better code (it indicate you when you do a typos, or when you do not follow the convention)

  1. do not put all your stuff in one file. split your project in several files each file keeping together function on same domain

  2. keep your functions simple. very often I see very big function which take a tons of arguments and do very different things in function of the arguments. split this in several function. It will be easier to read, test and maintain.

  3. keep files short. If a file become long (few hundred of lines). split it in several modules.

xkcd code quality

Important

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

(extract from The Zen of Python)

7.3.3. Outputs

xkcd standards proliferation

Don’t create new standard, just for fun. Think interoperability Below the output of two multiple sequences alignment: ClustalW and ProbCons. These two software declare to produce results in CLUSTALW format.

CLUSTAL

CLUSTAL W (1.82) multiple sequence alignment

MALK_ECOLI      MASVQLQNVTKAWGEVVVSKDINLDIHEGEFVVFVGPSGCGKSTL
MALK_SALTY      MASVQLRNVTKAWGDVVVSKDINLDIHDGEFVVFVGPSGCGKSTL
MALK_ENTAE      MASVQLRNVTKAWGDVVVSKDINLEIQDGEFVVFVGPSGCGKSTL
MALK_PHOLU      MSSVTLRNVSKAYGETIISKNINLEIQEGEF--------------
                *:** *:**:**:*:.::**:***:*::***

MALK_ECOLI      LRMIAGLETITSGDLACRRLHKEPGV
MALK_SALTY      LRMIAGLETITSGDLACRRLHQEPGV
MALK_ENTAE      LRMIAGLETVTSGDL-----------
MALK_PHOLU      LRM-----------------------
                ***

PROBCONS

PROBCONS version 1.08 multiple sequence alignment

MALK_ECOLI      MASVQLQNVTKAWGEVVVSKDINLDIHEGEFVVFVGPSGCGKSTL
MALK_SALTY      MASVQLRNVTKAWGDVVVSKDINLDIHDGEFVVFVGPSGCGKSTL
MALK_ENTAE      MASVQLRNVTKAWGDVVVSKDINLEIQDGEFVVFVGPSGCGKSTL
MALK_PHOLU      MSSVTLRNVSKAYGETIISKNINLEIQEGEF--------------
                *:** *:**:**:*:.::**:***:*::***

MALK_ECOLI      LRMIAGLETITSGDLACRRLHKEPGV
MALK_SALTY      LRMIAGLETITSGDLACRRLHQEPGV
MALK_ENTAE      LRMIAGLETVTSGDL-----------
MALK_PHOLU      LRM-----------------------
                ***

The problem is that the word CLUSTAL W is a part of the format definition. So even probcons claim that it generates a clustalw output the format is not recognized as valid alignment by lot of programs.

Your results will be probably read by a human, but certainly used by an other software. Each time you create a output think how I can parse this file?

This a macsyfinder output in text format. This output is suitable for human but very hard to parse

system id = VICH001.B.00001.C001_MSH_1
model = TFF-SF_final/MSH
replicon = VICH001.B.00001.C001
clusters = [('VICH001.B.00001.C001_00406', 'MSH_mshI', 366), ('VICH001.B.00001.C001_00407', 'MSH_mshJ', 367), ('VICH001.B.00001.C001_00408', 'MSH_mshK', 368), ('VICH001.B.00001.C001_00409', '
MSH_mshL', 369), ('VICH001.B.00001.C001_00410', 'MSH_mshM', 370), ('VICH001.B.00001.C001_00411', 'MSH_mshN', 371), ('VICH001.B.00001.C001_00412', 'MSH_mshE', 372), ('VICH001.B.00001.C001_0041
3', 'MSH_mshG', 373), ('VICH001.B.00001.C001_00414', 'MSH_mshF', 374), ('VICH001.B.00001.C001_00415', 'MSH_mshB', 375), ('VICH001.B.00001.C001_00416', 'MSH_mshA', 376), ('VICH001.B.00001.C001
_00417', 'MSH_mshC', 377), ('VICH001.B.00001.C001_00418', 'MSH_mshD', 378), ('VICH001.B.00001.C001_00419', 'MSH_mshO', 379), ('VICH001.B.00001.C001_00420', 'MSH_mshP', 380), ('VICH001.B.00001
.C001_00421', 'MSH_mshQ', 381)]
occ = 1
wholeness = 0.941
loci nb = 1
score = 10.500

mandatory genes:
        - MSH_mshA: 1 (MSH_mshA)
        - MSH_mshE: 1 (MSH_mshE)
        - MSH_mshG: 1 (MSH_mshG)
        - MSH_mshL: 1 (MSH_mshL)
        - MSH_mshM: 1 (MSH_mshM)

accessory genes:
        - MSH_mshB: 1 (MSH_mshB)
        - MSH_mshC: 1 (MSH_mshC)
        - MSH_mshD: 1 (MSH_mshD)
        - MSH_mshF: 1 (MSH_mshF)
        - MSH_mshI: 1 (MSH_mshI)
        - MSH_mshI2: 0 ()
        - MSH_mshJ: 1 (MSH_mshJ)
        - MSH_mshK: 1 (MSH_mshK)
        - MSH_mshN: 1 (MSH_mshN)
        - MSH_mshO: 1 (MSH_mshO)
        - MSH_mshQ: 1 (MSH_mshQ)
        - MSH_mshP: 1 (MSH_mshP)

neutral genes:

============================================================
system id = VICH001.B.00001.C001_T4P_14
model = TFF-SF_final/T4P

hopefully there is an other output in tabulated separated value format which is easier to parse

replicon	hit_id	gene_name	hit_pos	model_fqn	sys_id	sys_loci	locus_num	sys_wholeness	sys_score	sys_occ	hit_gene_ref	hit_status	hit_seq_len	hit_i_eval	hit_score	hit_profile_cov	hit_seq_cov	hit_begin_match	hit_end_match	used_in
GCF_000005845	GCF_000005845_000970	T4P_pilC	97	TFF-SF/T4P	GCF_000005845_T4P_106	3	1	0.556	7.800	1	T4P_pilC	mandatory	400	2.2e-105	353.100	0.991	0.830	62	393	
GCF_000005845	GCF_000005845_000980	T4P_pilB	98	TFF-SF/T4P	GCF_000005845_T4P_106	3	1	0.556	7.800	1	T4P_pilB	mandatory	461	8.9e-152	506.100	0.948	0.850	62	453	
GCF_000005845	GCF_000005845_000990	T4P_pilA	99	TFF-SF/T4P	GCF_000005845_T4P_106	3	1	0.556	7.800	1	T4P_pilA	accessory	146	1.1e-19	71.200	0.859	0.473	5	73	
GCF_000005845	GCF_000005845_025680	T4P_pilW	2568	TFF-SF/T4P	GCF_000005845_T4P_106	3	2	0.556	7.800	1	T4P_pilW	accessory	187	3.3e-08	34.500	0.625	0.401	6	80	
GCF_000005845	GCF_000005845_025690	T4P_fimT	2569	TFF-SF/T4P	GCF_000005845_T4P_106	3	2	0.556	7.800	1	T4P_fimT	accessory	156	2.5e-06	28.500	0.939	0.397	5	66	
GCF_000005845	GCF_000005845_026740	T4P_pilT	2674	TFF-SF/T4P	GCF_000005845_T4P_106	3	-1	0.556	7.800	1	T4P_pilT	mandatory	326	1.1e-117	393.600	0.944	0.979	3	321	
GCF_000005845	GCF_000005845_026930	T2SS_gspO	2693	TFF-SF/T4P	GCF_000005845_T4P_106	3	-2	0.556	7.800	1	T4P_pilD	mandatory	269	1.3e-87	294.000	1.000	0.859	30	260	GCF_000005845_T2SS_83
GCF_000005845	GCF_000005845_030590	T4P_pilQ	3059	TFF-SF/T4P	GCF_000005845_T4P_106	3	3	0.556	7.800	1	T4P_pilQ	mandatory	412	5.9e-51	173.100	0.919	0.408	244	411	
GCF_000005845	GCF_000005845_030620	T4P_pilN	3062	TFF-SF/T4P	GCF_000005845_T4P_106	3	3	0.556	7.800	1	T4P_pilN	mandatory	179	3.8e-09	37.500	0.986	0.765	5	141	
GCF_000005845	GCF_000005845_030630	T4P_pilM	3063	TFF-SF/T4P	GCF_000005845_T4P_106	3	3	0.556	7.800	1	T4P_pilM	accessory	259	1.1e-09	39.300	0.988	0.598	8	162	

to parse this output

import pandas as pd
best_solutions = pd.read_csv('msf_best_solutions.tsv', comment='#', sep= '\t)

Think about the data provenance. add how this result is produced.

# macsyfinder 20210428.dev
# /home/bneron/Projects/GEM/MacSyFinder/MacSyFinder/py37/bin/macsyfinder --db-type=gembase --models-dir=tests/data//models/ --models TFF-SF Archaeal-T4P ComM MSH T2SS T4bP T4P Tad --relative-path --sequence-db tests/data/base/gembase.fasta -w 12 --previous-run tests/data/functional_test_gembase/ -o functional_test_gembase
# Systems found:
replicon	hit_id	gene_name	hit_pos	model_fqn	sys_id	sys_loci	locus_num	sys_wholeness	sys_score	sys_occ	hit_gene_ref	hit_status	hit_seq_len	hit_i_eval	hit_score	hit_profile_cov	hit_seq_cov	hit_begin_match	hit_end_match	used_in
GCF_000005845	GCF_000005845_000970	T4P_pilC	97	TFF-SF/T4P	GCF_000005845_T4P_106	3	1	0.556	7.800	1	T4P_pilC	mandatory	400	2.2e-105	353.100	0.991	0.830	62	393	
GCF_000005845	GCF_000005845_000980	T4P_pilB	98	TFF-SF/T4P	GCF_000005845_T4P_106	3	1	0.556	7.800	1	T4P_pilB	mandatory	461	8.9e-152	506.100	0.948	0.850	62	453	
GCF_000005845	GCF_000005845_000990	T4P_pilA	99	TFF-SF/T4P	GCF_000005845_T4P_106	3	1	0.556	7.800	1	T4P_pilA	accessory	146	1.1e-19	71.200	0.859	0.473	5	73	
GCF_000005845	GCF_000005845_025680	T4P_pilW	2568	TFF-SF/T4P	GCF_000005845_T4P_106	3	2	0.556	7.800	1	T4P_pilW	accessory	187	3.3e-08	34.500	0.625	0.401	6	80	
GCF_000005845	GCF_000005845_025690	T4P_fimT	2569	TFF-SF/T4P	GCF_000005845_T4P_106	3	2	0.556	7.800	1	T4P_fimT	accessory	156	2.5e-06	28.500	0.939	0.397	5	66	
GCF_000005845	GCF_000005845_026740	T4P_pilT	2674	TFF-SF/T4P	GCF_000005845_T4P_106	3	-1	0.556	7.800	1	T4P_pilT	mandatory	326	1.1e-117	393.600	0.944	0.979	3	321	
GCF_000005845	GCF_000005845_026930	T2SS_gspO	2693	TFF-SF/T4P	GCF_000005845_T4P_106	3	-2	0.556	7.800	1	T4P_pilD	mandatory	269	1.3e-87	294.000	1.000	0.859	30	260	GCF_000005845_T2SS_83
GCF_000005845	GCF_000005845_030590	T4P_pilQ	3059	TFF-SF/T4P	GCF_000005845_T4P_106	3	3	0.556	7.800	1	T4P_pilQ	mandatory	412	5.9e-51	173.100	0.919	0.408	244	411	
GCF_000005845	GCF_000005845_030620	T4P_pilN	3062	TFF-SF/T4P	GCF_000005845_T4P_106	3	3	0.556	7.800	1	T4P_pilN	mandatory	179	3.8e-09	37.500	0.986	0.765	5	141	
GCF_000005845	GCF_000005845_030630	T4P_pilM	3063	TFF-SF/T4P	GCF_000005845_T4P_106	3	3	0.556	7.800	1	T4P_pilM	accessory	259	1.1e-09	39.300	0.988	0.598	8	162	

For instance in comments at the beginning of the file. It's not a problem for the parsing. But it's easy to know how to reproduce this result.

information file about complete genome

190 255 D CDS ESCO001.C.00001.C001_00001 Valid thrL @b0001@NP_414542.1@ b0001 1 190 255 | leader; Amino acid biosynthesis: Threonine thr operon leader peptide | ..
337 2799 D CDS ESCO001.C.00001.C001_00002 Valid thrA @b0002@NP_414543.1@ b0002 1 337 2799 | enzyme; Amino acid biosynthesis: Threonine Bifunctional aspartokinase/homoserine dehydrogenase 1 | bifunctional: aspartokinase I (N-terminal); homoserine dehydrogenase I (C-terminal) ..
2801 3733 D CDS ESCO001.C.00001.C001_00003 Valid thrB @b0003@NP_414544.1@ b0003 1 2801 3733 | enzyme; Amino acid biosynthesis: Threonine homoserine kinase | ..
3734 5020 D CDS ESCO001.C.00001.C001_00004 Valid thrC @b0004@NP_414545.1@ b0004 1 3734 5020 | enzyme; Amino acid biosynthesis: Threonine L-threonine synthase | ..
5234 5530 D CDS ESCO001.C.00001.C001_00005 Valid yaaX @b0005@NP_414546.1@ b0005 1 5234 5530 | DUF2502 family putative periplasmic protein | ..
5683 6459 C CDS ESCO001.C.00001.C001_00006 Valid yaaA @b0006@NP_414547.1@ b0006 1 5683 6459 | peroxide resistance protein, lowers intracellular iron | ..
6529 7959 C CDS ESCO001.C.00001.C001_00007 Valid yaaJ @b0007@NP_414548.1@ b0007 1 6529 7959 | putative transport; Transport of small molecules: Amino acids, amines putative transporter | inner membrane transport protein ..

information file about draft genome

266 1480 C CDS  ACBA.0917.00019.b0001_00001 tyrS | Tyrosine--tRNA ligase | 6.1.1.1 | similar to AA sequence:UniProtKB:P41256
1560 2687 D CDS ACBA.0917.00019.i0001_00002 anmK | Anhydro-N-acetylmuramic acid kinase | 2.7.1.170 | similar to AA sequence:UniProtKB:Q8EHB5
2815 3660 D CDS ACBA.0917.00019.i0001_00003 ephA | Epoxide hydrolase A | 3.3.2.10 | similar to AA sequence:UniProtKB:I6YGS0
3716 4051 C CDS ACBA.0917.00019.i0001_00004 erpA | Iron-sulfur cluster insertion protein ErpA | NA | similar to AA sequence:UniProtKB:P45344
4176 5180 C CDS ACBA.0917.00019.i0001_00005 NA | hypothetical protein | NA | NA
5523 6530 C CDS ACBA.0917.00019.i0001_00006 NA | hypothetical protein | NA | NA
6697 7131 C CDS ACBA.0917.00019.i0001_00007 NA | hypothetical protein | NA | NA
7218 7787 C CDS ACBA.0917.00019.i0001_00008 NA | hypothetical protein | NA | NA

Even the format seems structured, it is not though to be used easily.

What's wrong with these format?

  1. the field separator space is also used in fields (comment)

  2. in case of complete genome there is an extra columns with a key word Valid or Pseudo, ...

so we cannot know easily which format we have, we cannot rely on the columns numbers ...

use a separator which not appear in the fields, adopt a strict syntax (same number of columns fill with None, ...)

7.3.4. Versioning

The first step to reproducibility is the versioning.
I do not talk about git but to provide to the user a way to identify the version of your software.
A user should write program --version and get a clear version identifier.

There are several schemes for software versioning:

  • based on date 2020-08-10, 20200810 , ...

  • on sequence 1.0, 1.0.1,

    • development version 1.2.dev1234 a development version of the future 1.2 release. 1234 is the current repository version.

    • pre-release

      • alpha releases: early pre-releases. A lot of changes can occur between alphas and the final release, like feature additions or refactorings. But they are minor changes and the software should stay pretty unchanged by the time the first beta is reached.

      • beta releases: at this stage, no new features are added and developers are tracking remaning bugs.

      • release candidate: a release candidate is an ultimate release before the final release. Unless something bad happens, nothing is changed.

  • ...

for further details read this guide on versioning specifications

7.3.5. Web site

If You intend to develop a web site, read the pasteur security guide And contact the DSI informatique@pasteur.fr to know the deployment rules before you start to code.