Simulating Biopsies

This guide describes how to simulate the effects of biopsies and DNA sequencing on the observed clone sizes.

This only works for the 2D algorithms (Moran2D and WF2D)

Counting mutant cells

Here is a simple simulation of neutral competition on a 10x10 grid Start with 100 single-cell clones, each with their own clone_id

import numpy as np
from clone_competition_simulation import Parameters, TimeParameters, PopulationParameters

np.random.seed(0)
p = Parameters(
    algorithm='Moran2D', 
    times=TimeParameters(max_time=10, division_rate=1),
    population=PopulationParameters(initial_grid=np.arange(100).reshape(10, 10), 
                                    cell_in_own_neighbourhood=False)
)
s = p.get_simulator()
s.run_sim()

# The final clone ids on the grid
print(s.grid_results[-1])
[[18, 18, 18,  2,  2,  2, 15, 15, 24, 24],
 [24, 24, 18,  2,  2,  2,  2, 15, 18, 24],
 [24, 24, 54, 63, 63, 63, 63, 63, 50, 24],
 [18, 15, 54, 54, 63, 63, 63, 63, 63, 18],
 [24, 24, 54, 54, 15, 63, 63, 24, 59, 24],
 [24, 24, 24, 24, 54, 63, 63, 63, 59, 59],
 [59, 15, 59, 59, 24, 15, 63, 59, 59, 59],
 [59, 15, 59, 15, 15, 59, 63, 59, 59, 18],
 [18, 15, 15, 71, 71, 71, 18, 18, 59, 18],
 [18, 18, 71, 18, 71,  2, 18, 15, 86, 24]]

To count the number of cells per clone in the whole grid, use the function biopsy_sample with biopsy=None

from clone_competition_simulation import biopsy_sample

print(biopsy_sample(s.grid_results[-1], s, biopsy=None, remove_initial_clones=False))
defaultdict(int,
            {2: 8,
             15: 13,
             18: 16,
             24: 18,
             50: 1,
             54: 6,
             59: 15,
             63: 17,
             71: 5,
             86: 1})

To take a "biopsy" from the grid, define the biopsy location and dimensions.
The biopsies must be rectangular.

from clone_competition_simulation import Biopsy

biopsy = Biopsy(
    origin=(3, 4),  # The grid coordinates of the first corner of the biopsy
    shape=(5, 3)  # The lengths of the sides of the biopsy
)

# These are the cells in the biopsy. 
# Equivalent to
# s.grid_results[-1][biopsy['biopsy_origin'][0]:biopsy['biopsy_origin'][0]+biopsy['biopsy_shape'][0], 
#                    biopsy['biopsy_origin'][1]:biopsy['biopsy_origin'][1]+biopsy['biopsy_shape'][1]]
print(s.grid_results[-1][3:3+5, 4:4+3])
[[63, 63, 63],
 [15, 63, 63],
 [54, 63, 63],
 [24, 15, 63],
 [15, 59, 63]])

Get a cell count for all of the clones in the biopsy.


print(biopsy_sample(s.grid_results[-1], s, biopsy=biopsy, remove_initial_clones=False))
defaultdict(int, {15: 3, 24: 1, 54: 1, 59: 1, 63: 9})

Where there are mutations, and potentially multiple mutations in each cell, the biopsy_sample function becomes more useful.

Run a simulation with some ongoing mutation

from clone_competition_simulation import Gene, MutationGenerator, NormalDist, FitnessParameters

mut_gen = MutationGenerator(
    genes=[Gene(name="Gene1", mutation_distribution=NormalDist(0.1), synonymous_proportion=0.5)],
)

np.random.seed(0)
p = Parameters(
    algorithm='Moran2D',
    times=TimeParameters(max_time=10, division_rate=1),
    population=PopulationParameters(initial_cells=100, 
                                    cell_in_own_neighbourhood=False), 
    fitness=FitnessParameters(
        mutation_rates=0.1, 
        mutation_generator=mut_gen
    )
)
s = p.get_simulator()
s.run_sim()

# The final clone ids on the grid
s.grid_results[-1]
[[ 99,   0,   0, 103, 100,   0,   0,  86, 109,   0],
 [  0,  89,  89,   0, 103,   0,   0,   0,   0,   0],
 [ 12,  12, 106, 106,   0,   0,   0,   0,   0,  56],
 [ 79,  12,   0,  62,   0,   0,   0,   0,  79,  79],
 [ 12,  12,  12,  56,   0,   0, 111,   0,   0,  97],
 [ 56, 101,  56,  56,  93,  26,   0,  79,   0, 101],
 [101,   0,  21,  21,  26,   0,   0,   0, 108,   0],
 [  0,  21,   0,  21,   0,   0,   0,   0, 104,   0],
 [  0,  36,  21,   0, 100,  98,   0,   0,   0,   0],
 [ 36,   0, 110,   0, 100,   0,   0, 112,   0,  99]]

Use the same biopsy as before. It extracts these cells from the grid:

print(s.grid_results[-1][3:3+5, 4:4+3])
[[  0,   0,   0],
 [  0,   0, 111],
 [ 93,  26,   0],
 [ 26,   0,   0],
 [  0,   0,   0]]

And returns these mutant cell counts:

print(biopsy_sample(s.grid_results[-1], s, biopsy, remove_initial_clones=False))
defaultdict(int, {0: 15, 26: 2, 21: 2, 93: 1, 30: 1, 29: 1, 111: 1})

Notice that now the cell counts per clone do not match the numbers in the grid.
This is because clones (and their mutations) are inherited by any subclones (which have a different clone_id).


Firstly, the clone with id=0 has 15 cells according to the function, but only 11 in the biopsy. This is because clone_id 0 marks the clones that existed at the start of the simulation. All subsequent mutations are descendants of that clone, meaning that all 15 cells in the biopsy belong to clone_id 0 (and its subclones).

We can exclude any initial clones from the counts (this is the default behaviour).
Now the count for clone_id=0 is gone.

print(biopsy_sample(s.grid_results[-1], s, biopsy))
defaultdict(int, {26: 2, 21: 2, 93: 1, 30: 1, 29: 1, 111: 1})

There is still the question of the clones with ids 21, 30 and 29.
According to the function, these clones are present in the grid, but those numbers are not there.

To see why, we can look at the descendants/subclones of those clones.

for clone_id in [21, 29, 30]:
    print(f"Descendants of clone {clone_id}: {s.get_clone_descendants(clone_id)}")
Descendants of clone 21: [21, 26, 37, 62, 60, 68, 81]
Descendants of clone 29: [29, 30, 38, 43, 44, 48, 53, 55, 63, 72, 93, 96]
Descendants of clone 30: [30, 38, 43, 44, 48, 53, 55, 63, 72, 93, 96]

Clone 26 is a descendant of clone 21.
Clone 93 is a descendant of both clone 30 and clone 29.

So the mutations that originally formed the clones 21, 29 and 30 were inherited by subsequent (sub)clones and survive to the end. The function is counting the number of cells in the biopsy that contain each mutation.

Simulating sequencing

Usually the DNA sequencing cannot accurately measure the number of cells containing each mutation.
Instead, it randomly samples DNA reads from a tissue sample, and the proportion of the sample containing the mutation is estimated based on the fraction of sampled reads that are mutant.


Running a larger simulation (using with WF2D because it is much faster).

Get the exact mutant cell counts from the end of the simulation.

from clone_competition_simulation import FixedValue

mut_gen = MutationGenerator(
    genes=[
        Gene(name='Gene1', mutation_distribution=FixedValue(1.2), synonymous_proportion=0.5), 
        Gene(name='Gene2', mutation_distribution=FixedValue(1.4), synonymous_proportion=0.5), 
        Gene(name='Gene3', mutation_distribution=FixedValue(0.7), synonymous_proportion=0.5)
    ], 
    combine_mutations='add'
)

np.random.seed(0)
p = Parameters(
    algorithm='WF2D',
    times=TimeParameters(max_time=200, division_rate=1),
    population=PopulationParameters(initial_cells=10000, 
                                    cell_in_own_neighbourhood=False), 
    fitness=FitnessParameters(
        mutation_rates=0.01, 
        mutation_generator=mut_gen
    )
)
s = p.get_simulator()
s.run_sim()

# This function tells us the exact cell counts for each mutation in the grid
exact_counts = biopsy_sample(s.grid_results[-1], s, biopsy=None)
print(exact_counts)
defaultdict(int,
            {8327: 437,
             7410: 1634,
             6889: 1634,
             6419: 1634,
             4540: 1634,
             3353: 1634,
             2773: 1944,
             1589: 1944,
             1061: 1944,
             8521: 396,
             7216: 478,
             4654: 551,
             4254: 551,
             3563: 928,
             2785: 966,
             ...})

This includes a large number of small clones, some with just a few cells

import matplotlib.pyplot as plt

plt.hist(exact_counts.values(), bins=np.arange(0, max(exact_counts.values())+1, 50));
plt.ylabel('Frequency')
plt.xlabel('Cell count')
plt.show()

png


To convert the mutant cell counts to variant allele frequency use the get_vafs_for_all_biopsies function.
This divides the cell number by the total cells in the grid.

from clone_competition_simulation import get_vafs_for_all_biopsies

vafs = get_vafs_for_all_biopsies(s, biopsies=None, heterozygous=False)
vafs
sample vaf gene clone_id ns
0 0 0.0437 Gene1 8327 True
1 0 0.1634 Gene2 7410 True
2 0 0.1634 Gene1 6889 True
3 0 0.1634 Gene1 6419 True
4 0 0.1634 Gene1 4540 True
... ... ... ... ... ...
1020 0 0.0004 Gene1 20000 False
1021 0 0.0003 Gene2 20001 True
1022 0 0.0001 Gene1 20003 False
1023 0 0.0001 Gene3 20004 True
1024 0 0.0001 Gene3 20005 True

1025 rows × 5 columns


Most clones in this case are a tiny fraction of the tissue size.
They would not be detected by most sequencing methods.

plt.hist(vafs['vaf'], 
         bins=np.linspace(0, 0.5, 50));
plt.xlabel('Variant allele fraction')
plt.ylabel('Frequency')
plt.show()

png


We can also simulate DNA sequencing using get_vafs_for_all_biopsies. To get vafs for the entire grid, run get_vafs_for_all_biopsies with biopsies=None

Define the sequencing coverage and a minimum number of "reads" required for a mutation to be detected.
Here we assume 100x coverage and a minimum detection limit of 5 reads.

sequenced_vafs = get_vafs_for_all_biopsies(s, biopsies=None, detection_limit=5, coverage=100)
print(sequenced_vafs)
sample vaf gene clone_id ns
0 0 0.11 Gene2 7410 True
1 0 0.07 Gene1 6889 True
2 0 0.06 Gene1 6419 True
3 0 0.11 Gene1 4540 True
4 0 0.06 Gene2 3353 True
5 0 0.12 Gene2 2773 True
6 0 0.09 Gene2 1589 True
7 0 0.07 Gene1 1061 True
8 0 0.07 Gene1 3563 True
9 0 0.23 Gene2 3047 True
10 0 0.17 Gene2 2827 True
11 0 0.16 Gene2 752 True
12 0 0.20 Gene2 547 True
13 0 0.08 Gene2 5992 True
14 0 0.08 Gene2 5666 True
15 0 0.05 Gene2 7859 True
16 0 0.05 Gene2 4158 True
17 0 0.07 Gene2 2410 True
18 0 0.09 Gene2 2408 True
19 0 0.06 Gene2 1284 True
20 0 0.06 Gene2 385 True
21 0 0.05 Gene1 7386 True
22 0 0.06 Gene2 6655 False
23 0 0.05 Gene2 11002 True
24 0 0.05 Gene2 11874 True
25 0 0.05 Gene1 8646 True
26 0 0.06 Gene1 7034 True
27 0 0.06 Gene2 1105 True
28 0 0.05 Gene2 12877 True

The results are the VAFs of the detected clones.
The clones are by default assumed to be heterozygous (see below).
These will be noisy, since there is random chance whether a "read" contains the mutation or not. The smallest VAFs are now around 5% (5 reads out of the coverage of 100).


We can also simulate sequencing of a biopsy taken from the grid


biopsy = Biopsy(origin=(30, 40),  shape=(50, 30))

print(get_vafs_for_all_biopsies(s, biopsies=[biopsy], detection_limit=5, coverage=100, heterozygous=False))
sample vaf gene clone_id ns
0 0 0.23 Gene1 8327 True
1 0 0.35 Gene2 7410 True
2 0 0.35 Gene1 6889 True
3 0 0.39 Gene1 6419 True
4 0 0.32 Gene1 4540 True
5 0 0.40 Gene2 3353 True
6 0 0.53 Gene2 2773 True
7 0 0.51 Gene2 1589 True
8 0 0.46 Gene1 1061 True
9 0 0.20 Gene2 8610 True
10 0 0.28 Gene2 5992 True
11 0 0.27 Gene2 5666 True
12 0 0.37 Gene2 3047 True
13 0 0.42 Gene2 2827 True
14 0 0.37 Gene2 752 True
15 0 0.35 Gene2 547 True
16 0 0.20 Gene1 10103 True
17 0 0.11 Gene2 11002 True
18 0 0.07 Gene2 10494 True
19 0 0.22 Gene1 9170 True
20 0 0.12 Gene1 7386 True
21 0 0.09 Gene2 6655 False
22 0 0.16 Gene2 11702 True
23 0 0.17 Gene2 8874 True
24 0 0.13 Gene2 5578 True
25 0 0.13 Gene2 11935 True
26 0 0.08 Gene2 9199 True
27 0 0.13 Gene2 7809 True
28 0 0.09 Gene2 3765 True
29 0 0.07 Gene1 7034 True
30 0 0.05 Gene2 1106 True
31 0 0.12 Gene2 12839 True
32 0 0.08 Gene2 13010 True
33 0 0.06 Gene2 13604 True
34 0 0.05 Gene2 10816 True
35 0 0.07 Gene2 13818 True
36 0 0.08 Gene2 14463 True
37 0 0.06 Gene2 15150 True
38 0 0.11 Gene1 15560 False
39 0 0.10 Gene1 15271 True
40 0 0.10 Gene1 14861 False
41 0 0.08 Gene1 16704 True

In many cases, mutations are heterozygous (on only one of two copies of the chromosome).
To simulate this, set heterozygous=True (this is the default).
The VAFs are now smaller (around half the size), and so usually fewer clones are detected.

print(get_vafs_for_all_biopsies(s, biopsies=[biopsy], detection_limit=5, coverage=100, heterozygous=True))
sample vaf gene clone_id ns
0 0 0.07 Gene1 8327 True
1 0 0.17 Gene2 7410 True
2 0 0.25 Gene1 6889 True
3 0 0.25 Gene1 6419 True
4 0 0.12 Gene1 4540 True
5 0 0.18 Gene2 3353 True
6 0 0.22 Gene2 2773 True
7 0 0.27 Gene2 1589 True
8 0 0.27 Gene1 1061 True
9 0 0.09 Gene2 5992 True
10 0 0.16 Gene2 5666 True
11 0 0.18 Gene2 3047 True
12 0 0.19 Gene2 2827 True
13 0 0.22 Gene2 752 True
14 0 0.17 Gene2 547 True
15 0 0.07 Gene1 10103 True
16 0 0.09 Gene2 11002 True
17 0 0.05 Gene2 10494 True
18 0 0.06 Gene1 9170 True
19 0 0.07 Gene1 7386 True
20 0 0.07 Gene2 6655 False
21 0 0.06 Gene2 11702 True
22 0 0.08 Gene2 8874 True
23 0 0.05 Gene2 5578 True
24 0 0.09 Gene2 11935 True
25 0 0.06 Gene2 9199 True
26 0 0.08 Gene2 7809 True
27 0 0.06 Gene2 3765 True
28 0 0.05 Gene2 12839 True
29 0 0.05 Gene2 13604 True
30 0 0.12 Gene2 13818 True
31 0 0.08 Gene2 14463 True
32 0 0.05 Gene1 15560 False
33 0 0.05 Gene1 14861 False

Multiple biopsies

A list of biopies can be given. The clones detected from these biopsies can be merged (like the oesophagus grid-sequencing experiments in Martincorena and Fowler et al 2018).

# Make a list of biopsies
biopsies = [
    Biopsy(origin=(0, 0), shape=(20, 20)),
    Biopsy(origin=(20, 0), shape=(20, 20)),
    Biopsy(origin=(0, 20), shape=(20, 20)),
    Biopsy(origin=(20, 20), shape=(20, 20))
]

# This returns a data frame with information about all of the clones detected
print(get_vafs_for_all_biopsies(s, biopsies, detection_limit=5, coverage=100))
sample vaf gene clone_id ns
0 0 0.22 Gene2 9160 True
1 0 0.26 Gene2 8391 True
2 0 0.25 Gene2 5757 True
3 0 0.22 Gene2 5036 True
4 0 0.22 Gene2 4297 True
... ... ... ... ... ...
89 3 0.54 Gene2 6655 False
90 3 0.27 Gene2 12706 True
91 3 0.06 Gene1 13973 True
92 3 0.19 Gene2 14164 True
93 3 0.05 Gene1 17189 False

94 rows × 5 columns


If merge_clones=True, the same clone seen in multiple samples will be treated as a single, larger clone.
The "vaf" in this case is just the sum of the VAFs from the individual biopsies (can be greater than 1).

If the biopsies are all the same size, then the VAF will be relative to the size of a single biopsy.
If the biopsies are different sizes, it is better to use merge_clones=False and do the merging yourself to get accurate sizes.


print(get_vafs_for_all_biopsies(s, biopsies, detection_limit=5, coverage=100, merge_clones=True))
vaf gene clone_id ns
0 0.40 Gene2 9160 True
1 0.49 Gene2 8391 True
2 0.34 Gene2 5757 True
3 0.39 Gene2 5036 True
4 0.52 Gene2 4297 True
5 0.43 Gene2 1033 True
6 0.39 Gene2 337 True
7 0.20 Gene2 11862 True
8 0.19 Gene2 4890 True
9 0.51 Gene2 2410 True
10 0.47 Gene2 2408 True
11 0.28 Gene2 1284 True
12 0.38 Gene2 385 True
13 0.05 Gene2 12752 True
14 0.05 Gene2 10960 False
15 0.05 Gene2 9115 True
16 0.05 Gene2 4158 True
17 0.06 Gene2 13078 True
18 0.18 Gene2 8959 True
19 0.10 Gene2 7859 True
20 0.09 Gene1 4921 True
21 0.12 Gene2 4704 True
22 0.13 Gene2 3043 True
23 0.08 Gene2 761 True
24 0.10 Gene1 553 True
25 0.11 Gene1 15101 True
26 0.11 Gene2 14698 True
27 0.05 Gene2 16232 True
28 0.06 Gene1 16445 True
29 0.06 Gene3 17356 False
30 0.19 Gene1 9284 True
31 0.91 Gene1 7386 True
32 0.97 Gene2 6655 False
33 0.98 Gene2 5992 True
34 0.84 Gene2 5666 True
35 0.80 Gene2 3047 True
36 0.92 Gene2 2827 True
37 1.02 Gene2 752 True
38 1.08 Gene2 547 True
39 0.60 Gene2 12706 True
40 0.84 Gene2 11002 True
41 0.76 Gene2 10494 True
42 0.88 Gene1 9170 True
43 0.11 Gene2 13285 True
44 0.11 Gene2 14739 True
45 0.06 Gene2 16373 True
46 0.06 Gene2 9940 True
47 0.05 Gene2 13453 True
48 0.14 Gene2 9435 True
49 0.12 Gene2 5002 True
50 0.06 Gene2 13056 True
51 0.13 Gene2 14164 True

Selecting time points

You don't have to use the final grid. Instead, you can give a sample_num for the index of the sample time you want.

print('Time point:', s.times[10])


Time point: 20.0

Early on, there are fewer mutant clones large enough to detect

print(get_vafs_for_all_biopsies(s, biopsies, detection_limit=5, coverage=100, sample_num=10))
sample vaf gene clone_id ns
0 1 0.10 Gene2 26 True
1 1 0.14 Gene2 337 True
2 1 0.13 Gene2 1033 True
3 2 0.12 Gene2 547 True
4 2 0.06 Gene1 640 True
5 2 0.05 Gene2 535 True
6 2 0.09 Gene2 752 True

Random coverage

Instead of a fixed coverage, the coverage can be randomly drawn from a negative binomial distribution. The coverage is drawn independently for all mutations.

print(get_vafs_for_all_biopsies(s, biopsies, detection_limit=5, binom_params=(100, 0.5)))
sample vaf gene clone_id ns
0 0 0.215909 Gene2 9160 True
1 0 0.280899 Gene2 8391 True
2 0 0.275862 Gene2 5757 True
3 0 0.158879 Gene2 5036 True
4 0 0.240964 Gene2 4297 True
... ... ... ... ... ...
96 3 0.516129 Gene1 7386 True
97 3 0.451613 Gene2 6655 False
98 3 0.358696 Gene2 12706 True
99 3 0.148936 Gene2 14164 True
100 3 0.073684 Gene1 15040 False

101 rows × 5 columns

dN/dS

You can calculate dN/dS for the observed mutations. This takes into account the expected proportion of synonymous and non-synonymous mutations for each gene

from clone_competition_simulation import get_sample_dnds

vafs = get_vafs_for_all_biopsies(s, biopsies=None)
print('Overall dN/dS', get_sample_dnds(vafs, s))
for i in range(1, 4):
    gene = f"Gene{i}"
    print(f'dN/dS for {gene}', get_sample_dnds(vafs, s, gene=gene))
Overall dN/dS 1.7553763440860215
dN/dS for Gene1 1.5819672131147542
dN/dS for Gene2 2.946564885496183
dN/dS for Gene3 0.6218487394957983

dN/dS can vary greatly depending on the sequencing parameters


vafs = get_vafs_for_all_biopsies(s, biopsies=biopsies, detection_limit=4, coverage=500)
print('Overall dN/dS', get_sample_dnds(vafs, s))
for i in range(1, 4):
    gene = f"Gene{i}"
    print(f'dN/dS for {gene}', get_sample_dnds(vafs, s, gene=gene))
Overall dN/dS 4.4324324324324325
dN/dS for Gene1 3.2222222222222223
dN/dS for Gene2 7.444444444444445
dN/dS for Gene3 0.1