7.4. Tests

compile error





  • Humans make mistakes.

  • Thus almost any analysis code will include mistakes

  • This includes Unix, R, perl/python/ruby/node, ...

  • To increase the robustness of our analyses we must become better at detecting mistakes.

  • Every subset, function, method of every piece of code should be tested on test data (edge cases) to ensure that results are as expected.

  • functional and unit testing

../_images/spacer.png

Unit tests: Unit tests are written from a programmer's perspective. They ensure that a particular method of a class successfully performs a set of specific tasks. Each test confirms that a method produces the expected output when given a known input.

Functional tests: Functional tests are written from a user's perspective. These tests confirm that the system does what users are expecting it to.

Many times the development of a system is likened to the building of a house. While this analogy isn't quite correct, we can extend it for the purposes of understanding the difference between unit and functional tests. Unit testing is analogous to a building inspector visiting a house's construction site. He is focused on the various internal systems of the house, the foundation, framing, electrical, plumbing, and so on. He ensures (tests) that the parts of the house will work correctly and safely, that is, meet the building code. Functional tests in this scenario are analogous to the homeowner visiting this same construction site. He assumes that the internal systems will behave appropriately, that the building inspector is performing his task. The homeowner is focused on what it will be like to live in this house. He is concerned with how the house looks, are the various rooms a comfortable size, does the house fit the family's needs, are the windows in a good spot to catch the morning sun. The homeowner is performing functional tests on the house. He has the user's perspective. The building inspector is performing unit tests on the house. He has the builder's perspective.

Because both types of tests are necessary, you'll need guidelines for writing them.

Writing a suite of maintainable, automated tests without a testing framework is virtually impossible. So choose a testing framework. (https://www.softwaretestingtricks.com/2007/01/unit-testing-versus-functional-tests.html)

7.4.1. Invisible mistakes can be costly

Crucially, data analysis problems can be invisible: the analysis runs, the results seem biologically meaningful and are wonderfully interpretable, but they may in fact be completely wrong. Geoffrey Chang's story is an emblematic example. By the mid-2000s he was a young superstar professor crystallographer, having won prestigious awards and published high-profile papers providing 3D-structures of important proteins. For example:

  • Science (2001) Chang & Roth. Structure of MsbA from E. coli: a homolog of the multidrug resistance ATP binding cassette (ABC) transporters.

  • Journal of Molecular Biology (2003) Chang. Structure of MsbA from Vibrio cholera: a multidrug resistance ABC transporter homolog in a closed conformation.

  • Science (2005) Reyes & Chang. Structure of the ABC transporter MsbA in complex with ADP vanadate and lipopolysaccharide.

  • Science (2005) Pornillos et al. X-ray structure of the EmrE multidrug transporter in complex with a substrate. 310:1950-1953.

  • PNAS (2004) Ma & Chang Structure of the multidrug resistance efflux transporter EmrE from E. coli.

But in 2006, others independently obtained the 3D structure of an ortholog to one of those proteins. Surprisingly, the orthologous structure was essentially a mirror-image of Geoffrey Chang's result. After rigorously double-checking his scripts, Geoffrey Chang realized that:

  • "an in-house data reduction program introduced a change in sign [..,]".*

In other words, a simple +/- error led to plausible and highly publishable - but dramatically flawed - results:

He retracted all five papers.

This was devastating for Geoffrey Chang, for his career, for the people working with him, for the hundreds of scientists who based follow-up analyses and experiments on the flawed 3D structures, and for the taxpayers or foundations funding the research. A small but costly mistake. (https://software.ac.uk/blog/2016-09-26-how-avoid-having-retract-your-genomics-analysis)

7.4.2. Unit tests

Unit testing means testing individual modules of an application in isolation (without any interaction with dependencies) to confirm that the code is doing things right.

The unit tests must be write in same time as the code.

In python there is lot of testing frameworks:

In java Junit is very popular

each language have it's own frame work

7.4.2.1. Python example with unittest

7.4.2.1.1. Basic example

The code to test

  1#########################################################################
  2# MacSyFinder - Detection of macromolecular systems in protein dataset  #
  3#               using systems modelling and similarity search.          #
  4# Authors: Sophie Abby, Bertrand Neron                                  #
  5# Copyright (c) 2014-2021  Institut Pasteur (Paris) and CNRS.           #
  6# See the COPYRIGHT file for details                                    #
  7#                                                                       #
  8# This file is part of MacSyFinder package.                             #
  9#                                                                       #
 10# MacSyFinder is free software: you can redistribute it and/or modify   #
 11# it under the terms of the GNU General Public License as published by  #
 12# the Free Software Foundation, either version 3 of the License, or     #
 13# (at your option) any later version.                                   #
 14#                                                                       #
 15# MacSyFinder is distributed in the hope that it will be useful,        #
 16# but WITHOUT ANY WARRANTY; without even the implied warranty of        #
 17# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the          #
 18# GNU General Public License for more details .                         #
 19#                                                                       #
 20# You should have received a copy of the GNU General Public License     #
 21# along with MacSyFinder (COPYING).                                     #
 22# If not, see <https://www.gnu.org/licenses/>.                          #
 23#########################################################################
 24
 25
 26import logging
 27_log = logging.getLogger(__name__)
 28from enum import Enum
 29
 30from .error import MacsypyError
 31
 32
 33class GeneBank:
 34    """
 35    Store all Gene objects. Ensure that genes are instanciated only once.
 36    """
 37
 38    def __init__(self):
 39        self._genes_bank = {}
 40
 41    def __getitem__(self, key):
 42        """
 43        :param key: The key to retrieve a gene.
 44                    The key is composed of the name of models family and the gene name.
 45                    for instance CRISPR-Cas/cas9_TypeIIB ('CRISPR-Cas' , 'cas9_TypeIIB') or
 46                    TXSS/T6SS_tssH ('TXSS', 'T6SS_tssH')
 47        :type key: tuple (string, string)
 48        :return: return the Gene corresponding to the key.
 49        :rtype: :class:`macsypy.gene.CoreGene` object
 50        :raise KeyError: if the key does not exist in GeneBank.
 51        """
 52        try:
 53            return self._genes_bank[key]
 54        except KeyError:
 55            raise KeyError(f"No such gene '{key}' in this bank")
 56
 57
 58    def __len__(self):
 59        return len(self._genes_bank)
 60
 61
 62    def __contains__(self, gene):
 63        """
 64        Implement the membership test operator
 65
 66        :param gene: the gene to test
 67        :type gene: :class:`macsypy.gene.CoreGene` object
 68        :return: True if the gene is in, False otherwise
 69        :rtype: boolean
 70        """
 71        return gene in set(self._genes_bank.values())
 72
 73
 74    def __iter__(self):
 75        """
 76        Return an iterator object on the genes contained in the bank
 77        """
 78        return iter(self._genes_bank.values())
 79
 80
 81    def genes_fqn(self):
 82        """
 83        :return: the fully qualified name for all genes in the bank
 84        :rtype: str
 85        """
 86        return [f"{fam}/{gen_nam}" for fam, gen_nam in self._genes_bank.keys()]
 87
 88
 89    def add_new_gene(self, model_location, name, profile_factory):
 90        """
 91        Create a gene and store it in the bank. If the same gene (same name) is add twice,
 92        it is created only the first time.
 93
 94        :param model_location: the location where the model family can be found.
 95        :type model_location: :class:`macsypy.registry.ModelLocation` object
 96        :param name: the name of the gene to add
 97        :type name: str
 98        :param profile_factory: The Profile factory
 99        :type profile_factory: :class:`profile.ProfileFactory` object.
100        """
101        key = (model_location.name, name)
102        if key not in self._genes_bank:
103            gene = CoreGene(model_location, name, profile_factory)
104            self._genes_bank[key] = gene
105

the unit test with unittest framework

 1
 2class Test(MacsyTest):
 3
 4    def setUp(self):
 5        args = argparse.Namespace()
 6        args.sequence_db = self.find_data("base", "test_1.fasta")
 7        args.db_type = 'gembase'
 8        args.models_dir = self.find_data('models')
 9        args.res_search_dir = tempfile.gettempdir()
10        args.log_level = 30
11        self.cfg = Config(MacsyDefaults(), args)
12
13        self.model_name = 'foo'
14        self.model_location = ModelLocation(path=os.path.join(args.models_dir, self.model_name))
15        self.gene_bank = GeneBank()
16        self.profile_factory = ProfileFactory(self.cfg)
17
18    def tearDown(self):
19        try:
20            shutil.rmtree(self.cfg.working_dir)
21        except:
22            pass
23
24
25    def test_add_get_gene(self):
26        gene_name = 'sctJ_FLG'
27        with self.assertRaises(KeyError) as ctx:
28            self.gene_bank[f"foo/{gene_name}"]
29        self.assertEqual(str(ctx.exception),
30                         f"\"No such gene 'foo/{gene_name}' in this bank\"")
31        model_foo = Model(self.model_name, 10)
32
33        self.gene_bank.add_new_gene(self.model_location, gene_name, self.profile_factory)
34
35        gene_from_bank = self.gene_bank[(model_foo.family_name, gene_name)]
36        self.assertTrue(isinstance(gene_from_bank, CoreGene))
37        self.assertEqual(gene_from_bank.name, gene_name)
38        gbk_contains_before = list(self.gene_bank)
39        self.gene_bank.add_new_gene(self.model_location, gene_name, self.profile_factory)
40        gbk_contains_after = list(self.gene_bank)
41        self.assertEqual(gbk_contains_before, gbk_contains_after)
42
43        gene_name = "bar"
44        with self.assertRaises(MacsypyError) as ctx:
45            self.gene_bank.add_new_gene(self.model_location, gene_name, self.profile_factory)
46        self.assertEqual(str(ctx.exception),
47                         f"'{self.model_name}/{gene_name}': No such profile")
48
49
50    def test_contains(self):
51        model_foo = Model("foo/bar", 10)
52        gene_name = 'sctJ_FLG'
53
54        self.gene_bank.add_new_gene(self.model_location, gene_name, self.profile_factory)
55        gene_in = self.gene_bank[(model_foo.family_name, gene_name)]
56        self.assertIn(gene_in, self.gene_bank)
57
58        gene_name = 'abc'
59        c_gene_out = CoreGene(self.model_location, gene_name, self.profile_factory)
60        gene_out = ModelGene(c_gene_out, model_foo)
61        self.assertNotIn(gene_out, self.gene_bank)
62
63
64    def test_iter(self):
65        genes_names = ['sctJ_FLG', 'abc']
66        for g in genes_names:
67            self.gene_bank.add_new_gene(self.model_location, g, self.profile_factory)
68        self.assertListEqual([g.name for g in self.gene_bank],
69                             genes_names)
70
71    def test_genes_fqn(self):
72        genes_names = ['sctJ_FLG', 'abc']
73        for g in genes_names:
74            self.gene_bank.add_new_gene(self.model_location, g, self.profile_factory)
75        self.assertSetEqual(set(self.gene_bank.genes_fqn()),
76                             {f"{self.model_location.name}/{g.name}" for g in self.gene_bank})
77
78
79    def test_get_uniq_object(self):
80        gene_name = 'sctJ_FLG'
81        self.gene_bank.add_new_gene(self.model_location, gene_name, self.profile_factory)
82        self.gene_bank.add_new_gene(self.model_location, gene_name, self.profile_factory)
83        self.assertEqual(len(self.gene_bank), 1)

7.4.2.1.2. A litle more complex example

The code to test

 1
 2import itertools
 3import networkx as nx
 4
 5
 6def find_best_solutions(systems):
 7    """
 8    Among the systems choose the combination of systems which does not share :class:`macsypy.hit.Hit`
 9    and maximize the sum of systems scores
10
11    :param systems: the systems to analyse
12    :type systems: list of :class:`macsypy.system.System` object
13    :return: the list of list of systems which represent one best solution and the it's score
14    :rtype: tuple of 2 elements the best solution and it's score
15            ([[:class:`macsypy.system.System`, ...], [:class:`macsypy.system.System`, ...]], float score)
16            The inner list represent a best solution
17    """
18    def sort_cliques(clique):
19        """
20        sort cliques
21
22         - first by the sum of hits of systems composing the solution, most hits in first
23         - second by the number of systems, most system in first
24         - third by the average of wholeness of the systems
25         - and finally by hits position. This criteria is to produce predictable results
26           between two runs and to be testable (functional_test gembase)
27
28        :param clique: the solutions to sort
29        :type clique: List of :class:`macsypy.system.System` objects
30        :return: the clique ordered
31        """
32        l = []
33        for solution in clique:
34            hits_pos = {hit.position for syst in solution for hit in syst.hits}
35            hits_pos = sorted(list(hits_pos))
36            l.append((sorted(solution, key=lambda sys: sys.id), hits_pos))
37
38        sorted_cliques = sorted(l, key=lambda item: (sum([len(sys.hits) for sys in item[0]]),
39                                                     len(item[0]),
40                                                     item[1],
41                                                     sum([sys.wholeness for sys in item[0]]) / len(item[0]),
42                                                     '_'.join([sys.id for sys in item[0]])
43                                                     ),
44                                reverse=True)
45        sorted_cliques = [item[0] for item in sorted_cliques]
46        return sorted_cliques
47
48    G = nx.Graph()
49    # add nodes (vertices)
50    G.add_nodes_from(systems)
51    # let's create an edges between compatible nodes
52    for sys_i, sys_j in itertools.combinations(systems, 2):
53        if sys_i.is_compatible(sys_j):
54            G.add_edge(sys_i, sys_j)
55
56    cliques = nx.algorithms.clique.find_cliques(G)
57    max_score = None
58    max_cliques = []
59    for c in cliques:
60        current_score = sum([s.score for s in c])
61        if max_score is None or (current_score > max_score):
62            max_score = current_score
63            max_cliques = [c]
64        elif current_score == max_score:
65            max_cliques.append(c)
66    # sort the solutions (cliques)
67    solutions = sort_cliques(max_cliques)
68    return solutions, max_score

the unit test

 1
 2def _build_systems(cfg, profile_factory):
 3    model_name = 'foo'
 4    model_location = ModelLocation(path=os.path.join(cfg.models_dir()[0], model_name))
 5    model_A = Model("foo/A", 10)
 6    model_B = Model("foo/B", 10)
 7    model_C = Model("foo/C", 10)
 8    model_D = Model("foo/D", 10)
 9    model_E = Model("foo/E", 10)
10    model_F = Model("foo/F", 10)
11    model_G = Model("foo/G", 10)
12    model_H = Model("foo/H", 10)
13    model_I = Model("foo/I", 10)
14    model_J = Model("foo/J", 10)
15    model_K = Model("foo/K", 10)
16
17    c_gene_sctn_flg = CoreGene(model_location, "sctN_FLG", profile_factory)
18    gene_sctn_flg = ModelGene(c_gene_sctn_flg, model_B)
19    c_gene_sctj_flg = CoreGene(model_location, "sctJ_FLG", profile_factory)
20    gene_sctj_flg = ModelGene(c_gene_sctj_flg, model_B)
21    c_gene_flgB = CoreGene(model_location, "flgB", profile_factory)
22    gene_flgB = ModelGene(c_gene_flgB, model_B)
23    c_gene_tadZ = CoreGene(model_location, "tadZ", profile_factory)
24    gene_tadZ = ModelGene(c_gene_tadZ, model_B)
25    systems = {}
26
27    systems['A'] = System(model_A, [c1, c2], cfg.redundancy_penalty())  # 5 hits
28    # we need to tweek the replicon_id to have stable ressults
29    # whatever the number of tests ran
30    # or the tests order
31    systems['A'].id = "replicon_id_A"
32    systems['B'] = System(model_B, [c3], cfg.redundancy_penalty())  # 3 hits
33    systems['B'].id = "replicon_id_B"
34    systems['C'] = System(model_C, [c4], cfg.redundancy_penalty())  # 4 hits
35    systems['C'].id = "replicon_id_C"
36    systems['D'] = System(model_D, [c5], cfg.redundancy_penalty())  # 2 hits
37    systems['D'].id = "replicon_id_D"
38    systems['E'] = System(model_E, [c6], cfg.redundancy_penalty())  # 1 hit
39    systems['E'].id = "replicon_id_E"
40    systems['F'] = System(model_F, [c7], cfg.redundancy_penalty())  # 1 hit
41    systems['F'].id = "replicon_id_F"
42    systems['G'] = System(model_G, [c4], cfg.redundancy_penalty())  # 4 hits
43    systems['G'].id = "replicon_id_G"
44    systems['H'] = System(model_H, [c5], cfg.redundancy_penalty())  # 2 hits
45    systems['H'].id = "replicon_id_H"
46    systems['I'] = System(model_I, [c8], cfg.redundancy_penalty())  # 2 hits
47    systems['I'].id = "replicon_id_I"
48    systems['J'] = System(model_J, [c9], cfg.redundancy_penalty())  # 2 hits
49    systems['J'].id = "replicon_id_J"
50    systems['K'] = System(model_K, [c10], cfg.redundancy_penalty())  # 2 hits
51    systems['K'].id = "replicon_id_K"
52
53    return systems
54
55    def test_find_best_solution(self):
56        systems = [self.systems[k] for k in 'ABCD']
57        sorted_syst = sorted(systems, key=lambda s: (- s.score, s.id))
58        # sorted_syst = [('replicon_id_C', 3.0), ('replicon_id_B', 2.0), ('replicon_id_A', 1.5), ('replicon_id_D', 1.5)]
59        # replicon_id_C ['hit_sctj_flg', 'hit_tadZ', 'hit_flgB', 'hit_gspd']
60        # replicon_id_B ['hit_sctj_flg', 'hit_tadZ', 'hit_flgB']
61        # replicon_id_A ['hit_sctj', 'hit_sctn', 'hit_gspd', 'hit_sctj', 'hit_sctn']
62        # replicon_id_D ['hit_abc', 'hit_sctn']
63        # C and D are compatible 4.5
64        # B and A are compatible 3.5
65        # B and D are compatible 3.5
66        # So the best Solution expected is C D 4.5
67        best_sol, score = find_best_solutions(sorted_syst)
68        expected_sol = [[self.systems[k] for k in 'CD']]
69        # The order of solutions are not relevant
70        # The order of systems in each solutions are not relevant
71        # transform list in set to compare them
72        best_sol = {frozenset(sol) for sol in best_sol}
73        expected_sol = {frozenset(sol) for sol in expected_sol}
74        self.assertEqual(score, 4.5)
75        self.assertSetEqual(best_sol, expected_sol)
76
77        systems = [self.systems[k] for k in 'ABC']
78        sorted_syst = sorted(systems, key=lambda s: (- s.score, s.id))
79        # sorted_syst = [('replicon_id_C', 3.0), ('replicon_id_B', 2.0), ('replicon_id_A', 1.5)]
80        # replicon_id_C ['hit_sctj_flg', 'hit_tadZ', 'hit_flgB', 'hit_gspd']
81        # replicon_id_B ['hit_sctj_flg', 'hit_tadZ', 'hit_flgB']
82        # replicon_id_A ['hit_sctj', 'hit_sctn', 'hit_gspd', 'hit_sctj', 'hit_sctn']
83        # C is alone 3.0
84        # B and A are compatible 3.5
85        # So the best Solution expected is B and A
86        best_sol, score = find_best_solutions(sorted_syst)
87        expected_sol = [[self.systems[k] for k in 'BA']]
88        best_sol = {frozenset(sol) for sol in best_sol}
89        expected_sol = {frozenset(sol) for sol in expected_sol}
90        self.assertEqual(score, 3.5)
91        self.assertSetEqual(best_sol, expected_sol)
92

7.4.2.1.3. Example with mock

In the unit test west each function, method in isolation from the rest of the code. Sometimes a function need to connect to a distant server, for instance a database or call an external software ...

In this case we simulate this external resource.
This is what we call a mock

In the example below the tested function _url_json take as argument an url of a github repository (following the github REST api) and analyse the response in json and transform it in python.

Off course we do not connect to github and send lot of requests to the server.
We simulate the different github responses in function of the urls given as arguments
We replace on the fly the urlib.request.urlopen by my mock mocked_request_get
 1    def mocked_requests_get(url):
 2        class MockResponse:
 3            def __init__(self, data, status_code):
 4                self.data = io.BytesIO(bytes(data.encode("utf-8")))
 5                self.status_code = status_code
 6
 7            def read(self, length=-1):
 8                return self.data.read(length)
 9
10            def __enter__(self):
11                return self
12
13            def __exit__(self, type, value, traceback):
14                return False
15
16        if url == 'https://test_url_json/':
17            resp = {'fake': ['json', 'response']}
18            return MockResponse(json.dumps(resp), 200)
19        elif url == 'https://test_url_json/limit':
20            raise urllib.error.HTTPError(url, 403, 'forbidden', None, None)
21        elif url == 'https://api.github.com/orgs/remote_exists_true':
22            resp = {'type': 'Organization'}
23            return MockResponse(json.dumps(resp), 200)
24        elif url == 'https://api.github.com/orgs/remote_exists_false':
25            raise urllib.error.HTTPError(url, 404, 'not found', None, None)
26        elif url == 'https://api.github.com/orgs/remote_exists_server_error':
27            raise urllib.error.HTTPError(url, 500, 'Server Error', None, None)
28        elif url == 'https://api.github.com/orgs/remote_exists_unexpected_error':
29            raise urllib.error.HTTPError(url, 204, 'No Content', None, None)
30        elif url == 'https://api.github.com/orgs/list_packages/repos':
31            resp = [{'name': 'model_1'}, {'name': 'model_2'}]
32            return MockResponse(json.dumps(resp), 200)
33        elif url == 'https://api.github.com/repos/list_package_vers/model_1/tags':
34            resp = [{'name': 'v_1'}, {'name': 'v_2'}]
35            return MockResponse(json.dumps(resp), 200)
36        elif url == 'https://api.github.com/repos/list_package_vers/model_2/tags':
37            raise urllib.error.HTTPError(url, 404, 'not found', None, None)
38        elif url == 'https://api.github.com/repos/list_package_vers/model_3/tags':
39            raise urllib.error.HTTPError(url, 500, 'Server Error', None, None)
40        elif 'https://api.github.com/repos/package_download/fake/tarball/1.0' in url:
41            return MockResponse('fake data ' * 2, 200)
42        elif url == 'https://api.github.com/repos/package_download/bad_pack/tarball/0.2':
43            raise urllib.error.HTTPError(url, 404, 'not found', None, None)
44        elif url == 'https://raw.githubusercontent.com/get_metadata/foo/0.0/metadata.yml':
45            data = yaml.dump({"maintainer": {"name": "moi"}})
46            return MockResponse(data, 200)
47        else:
48            raise RuntimeError("test non prevu", url)
49
50    @patch('urllib.request.urlopen', side_effect=mocked_requests_get)
51    def test_url_json(self, mock_urlopen):
52        rem_exists = package.RemoteModelIndex.remote_exists
53        package.RemoteModelIndex.remote_exists = lambda x: True
54        remote = package.RemoteModelIndex(org="nimportnaoik")
55        remote.cache = self.tmpdir
56        try:
57            j = remote._url_json("https://test_url_json/")
58            self.assertDictEqual(j, {'fake': ['json', 'response']})
59        finally:
60            package.RemoteModelIndex.remote_exists = rem_exists
61
62
63    @patch('urllib.request.urlopen', side_effect=mocked_requests_get)
64    def test_url_json_reach_limit(self, mock_urlopen):
65        rem_exists = package.RemoteModelIndex.remote_exists
66        package.RemoteModelIndex.remote_exists = lambda x: True
67        remote = package.RemoteModelIndex(org="nimportnaoik")
68        remote.cache = self.tmpdir
69        try:
70            with self.assertRaises(MacsyDataLimitError) as ctx:
71                remote._url_json("https://test_url_json/limit")
72            self.assertEqual(str(ctx.exception),
73                             """You reach the maximum number of request per hour to github.
74Please wait before to try again.""")
75        finally:
76            package.RemoteModelIndex.remote_exists = rem_exists
77

7.4.2.1.4. Running the tests

When tests passed

unittest output

When test failed

unittest failed output

7.4.2.2. Python example with pytest

The code to test

 1
 2def classic_levenshtein(string_1, string_2):
 3    """
 4    Calculates the Levenshtein distance between two strings.
 5
 6    This version is easier to read, but significantly slower than the version
 7    below (up to several orders of magnitude). Useful for learning, less so
 8    otherwise.
 9
10    Usage::
11
12        >>> classic_levenshtein('kitten', 'sitting')
13        3
14        >>> classic_levenshtein('kitten', 'kitten')
15        0
16        >>> classic_levenshtein('', '')
17        0
18
19    """
20    len_1 = len(string_1)
21    len_2 = len(string_2)
22    cost = 0
23
24    if len_1 and len_2 and string_1[0] != string_2[0]:
25        cost = 1
26
27    if len_1 == 0:
28        return len_2
29    elif len_2 == 0:
30        return len_1
31    else:
32        return min(
33            classic_levenshtein(string_1[1:], string_2) + 1,
34            classic_levenshtein(string_1, string_2[1:]) + 1,
35            classic_levenshtein(string_1[1:], string_2[1:]) + cost,
36        )
37
38
39
40def wf_levenshtein(string_1, string_2):
41    """
42    Calculates the Levenshtein distance between two strings.
43
44    This version uses the Wagner-Fischer algorithm.
45
46    Usage::
47
48        >>> wf_levenshtein('kitten', 'sitting')
49        3
50        >>> wf_levenshtein('kitten', 'kitten')
51        0
52        >>> wf_levenshtein('', '')
53        0
54
55    """
56    len_1 = len(string_1) + 1
57    len_2 = len(string_2) + 1
58
59    d = [0] * (len_1 * len_2)
60
61    for i in range(len_1):
62        d[i] = i
63    for j in range(len_2):
64        d[j * len_1] = j
65
66    for j in range(1, len_2):
67        for i in range(1, len_1):
68            if string_1[i - 1] == string_2[j - 1]:
69                d[i + j * len_1] = d[i - 1 + (j - 1) * len_1]
70            else:
71                d[i + j * len_1] = min(
72                   d[i - 1 + j * len_1] + 1,        # deletion
73                   d[i + (j - 1) * len_1] + 1,      # insertion
74                   d[i - 1 + (j - 1) * len_1] + 1,  # substitution
75                )
76
77    return d[-1]
78

The test with pytest framework

 1from bioconvert.core.levenshtein import wf_levenshtein, classic_levenshtein
 2
 3
 4def test_wf_levenshtein():
 5    levenshtein_tests(classic_levenshtein)
 6
 7
 8def test_classic_levenshtein():
 9    levenshtein_tests(wf_levenshtein)
10
11
12def levenshtein_tests(levenshtein):
13    assert 1 == levenshtein('kitten', 'kittenn')
14    assert 3 == levenshtein('kitten', 'sitting')
15    assert 0 == levenshtein('kitten', 'kitten')
16    assert 0 == levenshtein('', '')
17    assert 2 == levenshtein('sitting', 'sititng')

7.4.3. Functional tests

It's also easy to do functional tests with unittest (depending how you code your entrypoint)

 1
 2    @unittest.skipIf(not which('hmmsearch'), 'hmmsearch not found in PATH')
 3    def test_gembase(self):
 4        """
 5
 6        """
 7        expected_result_dir = self.find_data("functional_test_gembase")
 8        args = "--db-type=gembase " \
 9               f"--models-dir={self.find_data('models')} " \
10               "--models TFF-SF all " \
11               "--out-dir={out_dir} " \
12               "--index-dir {out_dir} " \
13               f"--previous-run {expected_result_dir} " \
14               "--relative-path"
15
16        self._macsyfinder_run(args)
17        for file_name in (self.all_systems_tsv,
18                          self.all_best_solutions,
19                          self.best_solution,
20                          self.summary):
21            with self.subTest(file_name=file_name):
22                expected_result = self.find_data(expected_result_dir, file_name)
23                get_results = os.path.join(self.out_dir, file_name)
24                self.assertTsvEqual(expected_result, get_results, comment="#", tsv_type=file_name)
25        expected_result = self.find_data(expected_result_dir, self.rejected_clusters)
26        get_results = os.path.join(self.out_dir, self.rejected_clusters)
27        self.assertFileEqual(expected_result, get_results, comment="#")
28
29
30    def test_only_loners(self):
31        expected_result_dir = self.find_data("functional_test_only_loners")
32        args = "--db-type ordered_replicon " \
33               "--replicon-topology linear  " \
34               f"--models-dir {self.find_data('models')} " \
35               "-m test_loners MOB_cf_T5SS " \
36               "-o {out_dir} " \
37               "--index-dir {out_dir} " \
38               f"--previous-run {expected_result_dir} " \
39               "--relative-path"
40        self._macsyfinder_run(args)
41
42        for file_name in (self.all_systems_tsv,
43                          self.all_best_solutions,
44                          self.best_solution,
45                          self.summary,
46                          self.rejected_clusters):
47            with self.subTest(file_name=file_name):
48                expected_result = self.find_data(expected_result_dir, file_name)
49                get_results = os.path.join(self.out_dir, file_name)
50                self.assertFileEqual(expected_result, get_results, comment="#")
51

7.4.4. Coverage

To know if our test cover each condition in our code there exists frameworks which play the test and analyses which branch of code are tested or not. We call this operation a test coverage. In python coverage is a very efficient frameworks to do that.

coverage run --source macsypy tests/run_test.py -vv

coverage html

Below the summary output with the general coverage (96%) and the coverage for each python module

coverage html report summary output

To know what part of the code is not covered just click on the module

  • In yellow appear the partial: It's a condition which is always True or always False

  • In red the code non covered by tests

coverage html report detail output

A good testing set must cover more than 90% of the code.