Add performance section to readme

This commit is contained in:
2022-02-20 23:31:25 -06:00
parent 601e141fd0
commit 94b54b3416

View File

@@ -20,12 +20,12 @@ The problem of pairing TCRA/TCRB sequences thus reduces to the "assignment probl
matching on a bipartite graph--the subset of vertex-disjoint edges whose weights sum to the maximum possible value.
This is a well-studied combinatorial optimization problem, with many known solutions.
The best currently-known algorithm for bipartite graphs with integer weights--which is what BiGpairSEQ uses--is
from Duan and Su (2012). For a graph with m edges, n vertices per side, and maximum integer edge weight N,
their algorithm runs in **O(m sqrt(n) log(N))** time. This is the best known efficiency for finding a maximum weight
matching on a bipartite graph, and the integer edge weight requirement makes it ideal for BiGpairSEQ.
The most efficient known algorithm for maximum weight matching is from Duan and Su (2012), and requires a bipartite graph
with strictly integer edge weights. For a graph with m edges, n vertices per side, and maximum integer edge weight N,
their algorithm runs in **O(m sqrt(n) log(N))** time. As the graph representation of a pairSEQ experiment is
bipartite with integer weights, this algorithm is ideal for BiGpairSEQ.
Unfortunately, it's a fairly new algorithm. It is not implemented by the graph theory library used in this simulator.
Unfortunately, it's a fairly new algorithm, and not yet implemented by the graph theory library used in this simulator.
So this program instead uses the Fibonacci heap-based algorithm of Fredman and Tarjan (1987), which has a worst-case
runtime of **O(n (n log(n) + m))**. The algorithm is implemented as described in Melhorn and Näher (1999).
@@ -218,6 +218,22 @@ Example output:
P-values are calculated *after* BiGpairSEQ matching is completed, for purposes of comparison,
using the (2021 corrected) formula from the original pairSEQ paper. (Howie, et al. 2015)
### PERFORMANCE
Performance details of the example excerpted above:
On a home computer with a Ryzen 5600X CPU, 64GB of 3200MHz DDR4 RAM, and a PCIe 3.0 SSD, running Linux Mint 20.3 Edge (5.13 kernel),
the author ran a BiGpairSEQ simulation of a 96-well sample plate with 30,000 T cells/well comprising ~11,800 alphas and betas,
taken from a sample of 4,000,000 distinct cells with an exponential frequency distribution.
With min/max occupancy threshold of 3 and 94 wells for matching, and no other pre-filtering, BiGpairSEQ identified 5,151
correct pairings and 18 incorrect pairings, for an accuracy of 99.652%.
The simulation time was 14'22". If intermediate results were held in memory, this would be equivalent to the total elapsed time.
Since this implementation of BiGpairSEQ writes intermediate results to improve the efficiency of *repeated* simulations,
the actual elapsed time was greater. File I/O time was not measured, but took slightly less time than the simulation itself.
Real elapsed time from start to finish was under 30 minutes.
## TODO
* ~~Try invoking GC at end of workloads to reduce paging to disk~~ DONE