Simulating Biopsies

This guide describes how to simulate the effects of biopsies and DNA sequencing on the observed clone sizes.

This only works for the 2D algorithms (Moran2D and WF2D)

Counting mutant cells

Here is a simple simulation of neutral competition on a 10x10 grid Start with 100 single-cell clones, each with their own clone_id

import numpy as np
from clone_competition_simulation import Parameters, TimeParameters, PopulationParameters

np.random.seed(0)
p = Parameters(
    algorithm='Moran2D', 
    times=TimeParameters(max_time=10, division_rate=1),
    population=PopulationParameters(initial_grid=np.arange(100).reshape(10, 10), 
                                    cell_in_own_neighbourhood=False)
)
s = p.get_simulator()
s.run_sim()

# The final clone ids on the grid
print(s.grid_results[-1])

[[18, 18, 18,  2,  2,  2, 15, 15, 24, 24],
 [24, 24, 18,  2,  2,  2,  2, 15, 18, 24],
 [24, 24, 54, 63, 63, 63, 63, 63, 50, 24],
 [18, 15, 54, 54, 63, 63, 63, 63, 63, 18],
 [24, 24, 54, 54, 15, 63, 63, 24, 59, 24],
 [24, 24, 24, 24, 54, 63, 63, 63, 59, 59],
 [59, 15, 59, 59, 24, 15, 63, 59, 59, 59],
 [59, 15, 59, 15, 15, 59, 63, 59, 59, 18],
 [18, 15, 15, 71, 71, 71, 18, 18, 59, 18],
 [18, 18, 71, 18, 71,  2, 18, 15, 86, 24]]

To count the number of cells per clone in the whole grid, use the function biopsy_sample with biopsy=None

from clone_competition_simulation import biopsy_sample

print(biopsy_sample(s.grid_results[-1], s, biopsy=None, remove_initial_clones=False))

defaultdict(int,
            {2: 8,
             15: 13,
             18: 16,
             24: 18,
             50: 1,
             54: 6,
             59: 15,
             63: 17,
             71: 5,
             86: 1})

To take a "biopsy" from the grid, define the biopsy location and dimensions.
The biopsies must be rectangular.

from clone_competition_simulation import Biopsy

biopsy = Biopsy(
    origin=(3, 4),  # The grid coordinates of the first corner of the biopsy
    shape=(5, 3)  # The lengths of the sides of the biopsy
)

# These are the cells in the biopsy. 
# Equivalent to
# s.grid_results[-1][biopsy['biopsy_origin'][0]:biopsy['biopsy_origin'][0]+biopsy['biopsy_shape'][0], 
#                    biopsy['biopsy_origin'][1]:biopsy['biopsy_origin'][1]+biopsy['biopsy_shape'][1]]
print(s.grid_results[-1][3:3+5, 4:4+3])

[[63, 63, 63],
 [15, 63, 63],
 [54, 63, 63],
 [24, 15, 63],
 [15, 59, 63]])

Get a cell count for all of the clones in the biopsy.


print(biopsy_sample(s.grid_results[-1], s, biopsy=biopsy, remove_initial_clones=False))

defaultdict(int, {15: 3, 24: 1, 54: 1, 59: 1, 63: 9})

Where there are mutations, and potentially multiple mutations in each cell, the biopsy_sample function becomes more useful.

Run a simulation with some ongoing mutation

from clone_competition_simulation import Gene, MutationGenerator, NormalDist, FitnessParameters

mut_gen = MutationGenerator(
    genes=[Gene(name="Gene1", mutation_distribution=NormalDist(0.1), synonymous_proportion=0.5)],
)

np.random.seed(0)
p = Parameters(
    algorithm='Moran2D',
    times=TimeParameters(max_time=10, division_rate=1),
    population=PopulationParameters(initial_cells=100, 
                                    cell_in_own_neighbourhood=False), 
    fitness=FitnessParameters(
        mutation_rates=0.1, 
        mutation_generator=mut_gen
    )
)
s = p.get_simulator()
s.run_sim()

# The final clone ids on the grid
s.grid_results[-1]

[[ 99,   0,   0, 103, 100,   0,   0,  86, 109,   0],
 [  0,  89,  89,   0, 103,   0,   0,   0,   0,   0],
 [ 12,  12, 106, 106,   0,   0,   0,   0,   0,  56],
 [ 79,  12,   0,  62,   0,   0,   0,   0,  79,  79],
 [ 12,  12,  12,  56,   0,   0, 111,   0,   0,  97],
 [ 56, 101,  56,  56,  93,  26,   0,  79,   0, 101],
 [101,   0,  21,  21,  26,   0,   0,   0, 108,   0],
 [  0,  21,   0,  21,   0,   0,   0,   0, 104,   0],
 [  0,  36,  21,   0, 100,  98,   0,   0,   0,   0],
 [ 36,   0, 110,   0, 100,   0,   0, 112,   0,  99]]

Use the same biopsy as before. It extracts these cells from the grid:

print(s.grid_results[-1][3:3+5, 4:4+3])

[[  0,   0,   0],
 [  0,   0, 111],
 [ 93,  26,   0],
 [ 26,   0,   0],
 [  0,   0,   0]]

And returns these mutant cell counts:

print(biopsy_sample(s.grid_results[-1], s, biopsy, remove_initial_clones=False))

defaultdict(int, {0: 15, 26: 2, 21: 2, 93: 1, 30: 1, 29: 1, 111: 1})

Notice that now the cell counts per clone do not match the numbers in the grid.
This is because clones (and their mutations) are inherited by any subclones (which have a different clone_id).

Firstly, the clone with id=0 has 15 cells according to the function, but only 11 in the biopsy. This is because clone_id 0 marks the clones that existed at the start of the simulation. All subsequent mutations are descendants of that clone, meaning that all 15 cells in the biopsy belong to clone_id 0 (and its subclones).

We can exclude any initial clones from the counts (this is the default behaviour).
Now the count for clone_id=0 is gone.

print(biopsy_sample(s.grid_results[-1], s, biopsy))

defaultdict(int, {26: 2, 21: 2, 93: 1, 30: 1, 29: 1, 111: 1})

There is still the question of the clones with ids 21, 30 and 29.
According to the function, these clones are present in the grid, but those numbers are not there.

To see why, we can look at the descendants/subclones of those clones.

for clone_id in [21, 29, 30]:
    print(f"Descendants of clone {clone_id}: {s.get_clone_descendants(clone_id)}")

Descendants of clone 21: [21, 26, 37, 62, 60, 68, 81]
Descendants of clone 29: [29, 30, 38, 43, 44, 48, 53, 55, 63, 72, 93, 96]
Descendants of clone 30: [30, 38, 43, 44, 48, 53, 55, 63, 72, 93, 96]

Clone 26 is a descendant of clone 21.
Clone 93 is a descendant of both clone 30 and clone 29.

So the mutations that originally formed the clones 21, 29 and 30 were inherited by subsequent (sub)clones and survive to the end. The function is counting the number of cells in the biopsy that contain each mutation.

Simulating sequencing

Usually the DNA sequencing cannot accurately measure the number of cells containing each mutation.
Instead, it randomly samples DNA reads from a tissue sample, and the proportion of the sample containing the mutation is estimated based on the fraction of sampled reads that are mutant.

Running a larger simulation (using with WF2D because it is much faster).

Get the exact mutant cell counts from the end of the simulation.

from clone_competition_simulation import FixedValue

mut_gen = MutationGenerator(
    genes=[
        Gene(name='Gene1', mutation_distribution=FixedValue(1.2), synonymous_proportion=0.5), 
        Gene(name='Gene2', mutation_distribution=FixedValue(1.4), synonymous_proportion=0.5), 
        Gene(name='Gene3', mutation_distribution=FixedValue(0.7), synonymous_proportion=0.5)
    ], 
    combine_mutations='add'
)

np.random.seed(0)
p = Parameters(
    algorithm='WF2D',
    times=TimeParameters(max_time=200, division_rate=1),
    population=PopulationParameters(initial_cells=10000, 
                                    cell_in_own_neighbourhood=False), 
    fitness=FitnessParameters(
        mutation_rates=0.01, 
        mutation_generator=mut_gen
    )
)
s = p.get_simulator()
s.run_sim()

# This function tells us the exact cell counts for each mutation in the grid
exact_counts = biopsy_sample(s.grid_results[-1], s, biopsy=None)
print(exact_counts)

defaultdict(int,
            {8327: 437,
             7410: 1634,
             6889: 1634,
             6419: 1634,
             4540: 1634,
             3353: 1634,
             2773: 1944,
             1589: 1944,
             1061: 1944,
             8521: 396,
             7216: 478,
             4654: 551,
             4254: 551,
             3563: 928,
             2785: 966,
             ...})

This includes a large number of small clones, some with just a few cells

import matplotlib.pyplot as plt

plt.hist(exact_counts.values(), bins=np.arange(0, max(exact_counts.values())+1, 50));
plt.ylabel('Frequency')
plt.xlabel('Cell count')
plt.show()

png

To convert the mutant cell counts to variant allele frequency use the get_vafs_for_all_biopsies function.
This divides the cell number by the total cells in the grid.

from clone_competition_simulation import get_vafs_for_all_biopsies

vafs = get_vafs_for_all_biopsies(s, biopsies=None, heterozygous=False)
vafs

	sample	vaf	gene	clone_id	ns
0	0	0.0437	Gene1	8327	True
1	0	0.1634	Gene2	7410	True
2	0	0.1634	Gene1	6889	True
3	0	0.1634	Gene1	6419	True
4	0	0.1634	Gene1	4540	True
...	...	...	...	...	...
1020	0	0.0004	Gene1	20000	False
1021	0	0.0003	Gene2	20001	True
1022	0	0.0001	Gene1	20003	False
1023	0	0.0001	Gene3	20004	True
1024	0	0.0001	Gene3	20005	True

1025 rows × 5 columns

Most clones in this case are a tiny fraction of the tissue size.
They would not be detected by most sequencing methods.

plt.hist(vafs['vaf'], 
         bins=np.linspace(0, 0.5, 50));
plt.xlabel('Variant allele fraction')
plt.ylabel('Frequency')
plt.show()

png

We can also simulate DNA sequencing using get_vafs_for_all_biopsies. To get vafs for the entire grid, run get_vafs_for_all_biopsies with biopsies=None

Define the sequencing coverage and a minimum number of "reads" required for a mutation to be detected.
Here we assume 100x coverage and a minimum detection limit of 5 reads.

sequenced_vafs = get_vafs_for_all_biopsies(s, biopsies=None, detection_limit=5, coverage=100)
print(sequenced_vafs)

	vaf	gene	clone_id	ns
0	0.11	Gene2	7410	True
1	0.07	Gene1	6889	True
2	0.06	Gene1	6419	True
3	0.11	Gene1	4540	True
4	0.06	Gene2	3353	True
5	0.12	Gene2	2773	True
6	0.09	Gene2	1589	True
7	0.07	Gene1	1061	True
8	0.07	Gene1	3563	True
9	0.23	Gene2	3047	True
10	0.17	Gene2	2827	True
11	0.16	Gene2	752	True
12	0.20	Gene2	547	True
13	0.08	Gene2	5992	True
14	0.08	Gene2	5666	True
15	0.05	Gene2	7859	True
16	0.05	Gene2	4158	True
17	0.07	Gene2	2410	True
18	0.09	Gene2	2408	True
19	0.06	Gene2	1284	True
20	0.06	Gene2	385	True
21	0.05	Gene1	7386	True
22	0.06	Gene2	6655	False
23	0.05	Gene2	11002	True
24	0.05	Gene2	11874	True
25	0.05	Gene1	8646	True
26	0.06	Gene1	7034	True
27	0.06	Gene2	1105	True
28	0.05	Gene2	12877	True

The results are the VAFs of the detected clones.
The clones are by default assumed to be heterozygous (see below).
These will be noisy, since there is random chance whether a "read" contains the mutation or not. The smallest VAFs are now around 5% (5 reads out of the coverage of 100).

We can also simulate sequencing of a biopsy taken from the grid


biopsy = Biopsy(origin=(30, 40),  shape=(50, 30))

print(get_vafs_for_all_biopsies(s, biopsies=[biopsy], detection_limit=5, coverage=100, heterozygous=False))

	vaf	gene	clone_id	ns
0	0.23	Gene1	8327	True
1	0.35	Gene2	7410	True
2	0.35	Gene1	6889	True
3	0.39	Gene1	6419	True
4	0.32	Gene1	4540	True
5	0.40	Gene2	3353	True
6	0.53	Gene2	2773	True
7	0.51	Gene2	1589	True
8	0.46	Gene1	1061	True
9	0.20	Gene2	8610	True
10	0.28	Gene2	5992	True
11	0.27	Gene2	5666	True
12	0.37	Gene2	3047	True
13	0.42	Gene2	2827	True
14	0.37	Gene2	752	True
15	0.35	Gene2	547	True
16	0.20	Gene1	10103	True
17	0.11	Gene2	11002	True
18	0.07	Gene2	10494	True
19	0.22	Gene1	9170	True
20	0.12	Gene1	7386	True
21	0.09	Gene2	6655	False
22	0.16	Gene2	11702	True
23	0.17	Gene2	8874	True
24	0.13	Gene2	5578	True
25	0.13	Gene2	11935	True
26	0.08	Gene2	9199	True
27	0.13	Gene2	7809	True
28	0.09	Gene2	3765	True
29	0.07	Gene1	7034	True
30	0.05	Gene2	1106	True
31	0.12	Gene2	12839	True
32	0.08	Gene2	13010	True
33	0.06	Gene2	13604	True
34	0.05	Gene2	10816	True
35	0.07	Gene2	13818	True
36	0.08	Gene2	14463	True
37	0.06	Gene2	15150	True
38	0.11	Gene1	15560	False
39	0.10	Gene1	15271	True
40	0.10	Gene1	14861	False
41	0.08	Gene1	16704	True

In many cases, mutations are heterozygous (on only one of two copies of the chromosome).
To simulate this, set heterozygous=True (this is the default).
The VAFs are now smaller (around half the size), and so usually fewer clones are detected.

print(get_vafs_for_all_biopsies(s, biopsies=[biopsy], detection_limit=5, coverage=100, heterozygous=True))

	vaf	gene	clone_id	ns
0	0.07	Gene1	8327	True
1	0.17	Gene2	7410	True
2	0.25	Gene1	6889	True
3	0.25	Gene1	6419	True
4	0.12	Gene1	4540	True
5	0.18	Gene2	3353	True
6	0.22	Gene2	2773	True
7	0.27	Gene2	1589	True
8	0.27	Gene1	1061	True
9	0.09	Gene2	5992	True
10	0.16	Gene2	5666	True
11	0.18	Gene2	3047	True
12	0.19	Gene2	2827	True
13	0.22	Gene2	752	True
14	0.17	Gene2	547	True
15	0.07	Gene1	10103	True
16	0.09	Gene2	11002	True
17	0.05	Gene2	10494	True
18	0.06	Gene1	9170	True
19	0.07	Gene1	7386	True
20	0.07	Gene2	6655	False
21	0.06	Gene2	11702	True
22	0.08	Gene2	8874	True
23	0.05	Gene2	5578	True
24	0.09	Gene2	11935	True
25	0.06	Gene2	9199	True
26	0.08	Gene2	7809	True
27	0.06	Gene2	3765	True
28	0.05	Gene2	12839	True
29	0.05	Gene2	13604	True
30	0.12	Gene2	13818	True
31	0.08	Gene2	14463	True
32	0.05	Gene1	15560	False
33	0.05	Gene1	14861	False

Multiple biopsies

A list of biopies can be given. The clones detected from these biopsies can be merged (like the oesophagus grid-sequencing experiments in Martincorena and Fowler et al 2018).

# Make a list of biopsies
biopsies = [
    Biopsy(origin=(0, 0), shape=(20, 20)),
    Biopsy(origin=(20, 0), shape=(20, 20)),
    Biopsy(origin=(0, 20), shape=(20, 20)),
    Biopsy(origin=(20, 20), shape=(20, 20))
]

# This returns a data frame with information about all of the clones detected
print(get_vafs_for_all_biopsies(s, biopsies, detection_limit=5, coverage=100))

	sample	vaf	gene	clone_id	ns
0	0	0.22	Gene2	9160	True
1	0	0.26	Gene2	8391	True
2	0	0.25	Gene2	5757	True
3	0	0.22	Gene2	5036	True
4	0	0.22	Gene2	4297	True
...	...	...	...	...	...
89	3	0.54	Gene2	6655	False
90	3	0.27	Gene2	12706	True
91	3	0.06	Gene1	13973	True
92	3	0.19	Gene2	14164	True
93	3	0.05	Gene1	17189	False

94 rows × 5 columns

If merge_clones=True, the same clone seen in multiple samples will be treated as a single, larger clone.
The "vaf" in this case is just the sum of the VAFs from the individual biopsies (can be greater than 1).

If the biopsies are all the same size, then the VAF will be relative to the size of a single biopsy.
If the biopsies are different sizes, it is better to use merge_clones=False and do the merging yourself to get accurate sizes.


print(get_vafs_for_all_biopsies(s, biopsies, detection_limit=5, coverage=100, merge_clones=True))

	vaf	gene	clone_id	ns
0	0.40	Gene2	9160	True
1	0.49	Gene2	8391	True
2	0.34	Gene2	5757	True
3	0.39	Gene2	5036	True
4	0.52	Gene2	4297	True
5	0.43	Gene2	1033	True
6	0.39	Gene2	337	True
7	0.20	Gene2	11862	True
8	0.19	Gene2	4890	True
9	0.51	Gene2	2410	True
10	0.47	Gene2	2408	True
11	0.28	Gene2	1284	True
12	0.38	Gene2	385	True
13	0.05	Gene2	12752	True
14	0.05	Gene2	10960	False
15	0.05	Gene2	9115	True
16	0.05	Gene2	4158	True
17	0.06	Gene2	13078	True
18	0.18	Gene2	8959	True
19	0.10	Gene2	7859	True
20	0.09	Gene1	4921	True
21	0.12	Gene2	4704	True
22	0.13	Gene2	3043	True
23	0.08	Gene2	761	True
24	0.10	Gene1	553	True
25	0.11	Gene1	15101	True
26	0.11	Gene2	14698	True
27	0.05	Gene2	16232	True
28	0.06	Gene1	16445	True
29	0.06	Gene3	17356	False
30	0.19	Gene1	9284	True
31	0.91	Gene1	7386	True
32	0.97	Gene2	6655	False
33	0.98	Gene2	5992	True
34	0.84	Gene2	5666	True
35	0.80	Gene2	3047	True
36	0.92	Gene2	2827	True
37	1.02	Gene2	752	True
38	1.08	Gene2	547	True
39	0.60	Gene2	12706	True
40	0.84	Gene2	11002	True
41	0.76	Gene2	10494	True
42	0.88	Gene1	9170	True
43	0.11	Gene2	13285	True
44	0.11	Gene2	14739	True
45	0.06	Gene2	16373	True
46	0.06	Gene2	9940	True
47	0.05	Gene2	13453	True
48	0.14	Gene2	9435	True
49	0.12	Gene2	5002	True
50	0.06	Gene2	13056	True
51	0.13	Gene2	14164	True

Selecting time points

You don't have to use the final grid. Instead, you can give a sample_num for the index of the sample time you want.

print('Time point:', s.times[10])

Time point: 20.0

Early on, there are fewer mutant clones large enough to detect

print(get_vafs_for_all_biopsies(s, biopsies, detection_limit=5, coverage=100, sample_num=10))

	sample	vaf	gene	clone_id	ns
0	1	0.10	Gene2	26	True
1	1	0.14	Gene2	337	True
2	1	0.13	Gene2	1033	True
3	2	0.12	Gene2	547	True
4	2	0.06	Gene1	640	True
5	2	0.05	Gene2	535	True
6	2	0.09	Gene2	752	True

Random coverage

Instead of a fixed coverage, the coverage can be randomly drawn from a negative binomial distribution. The coverage is drawn independently for all mutations.

print(get_vafs_for_all_biopsies(s, biopsies, detection_limit=5, binom_params=(100, 0.5)))

	sample	vaf	gene	clone_id	ns
0	0	0.215909	Gene2	9160	True
1	0	0.280899	Gene2	8391	True
2	0	0.275862	Gene2	5757	True
3	0	0.158879	Gene2	5036	True
4	0	0.240964	Gene2	4297	True
...	...	...	...	...	...
96	3	0.516129	Gene1	7386	True
97	3	0.451613	Gene2	6655	False
98	3	0.358696	Gene2	12706	True
99	3	0.148936	Gene2	14164	True
100	3	0.073684	Gene1	15040	False

101 rows × 5 columns

dN/dS

You can calculate dN/dS for the observed mutations. This takes into account the expected proportion of synonymous and non-synonymous mutations for each gene

from clone_competition_simulation import get_sample_dnds

vafs = get_vafs_for_all_biopsies(s, biopsies=None)
print('Overall dN/dS', get_sample_dnds(vafs, s))
for i in range(1, 4):
    gene = f"Gene{i}"
    print(f'dN/dS for {gene}', get_sample_dnds(vafs, s, gene=gene))

Overall dN/dS 1.7553763440860215
dN/dS for Gene1 1.5819672131147542
dN/dS for Gene2 2.946564885496183
dN/dS for Gene3 0.6218487394957983

dN/dS can vary greatly depending on the sequencing parameters


vafs = get_vafs_for_all_biopsies(s, biopsies=biopsies, detection_limit=4, coverage=500)
print('Overall dN/dS', get_sample_dnds(vafs, s))
for i in range(1, 4):
    gene = f"Gene{i}"
    print(f'dN/dS for {gene}', get_sample_dnds(vafs, s, gene=gene))

Overall dN/dS 4.4324324324324325
dN/dS for Gene1 3.2222222222222223
dN/dS for Gene2 7.444444444444445
dN/dS for Gene3 0.1