Files
BiGpairSEQ/readme.md
2022-02-20 13:23:15 -06:00

3.8 KiB

BiGpairSEQ SIMULATOR

ABOUT

This program simulates BiGpairSEQ, a graph theory based adaptation of the pairSEQ algorithm for pairing T cell receptor sequences.

USAGE

Released as an executable .jar file with interactive, command line UI. Requires Java11 or higher (openjdk-17 recommended).

Run with the command:

java -jar BiGpairSEQ_Sim.jar

Processing sample plates with tens of thousands of sequences may require large amounts of RAM. It is often desirable to increase the JVM maximum heap allocation with the -Xmx flag. For example, to run the program with 32 gigabytes of memory, use the command:

java -Xmx32G -jar BiGpairSEQ_Sim.jar

Note that you cannot allocate more RAM than is physically present on the system.

OUTPUT

THEORY

Unlike pairSEQ, which calculates p-values for every TCR alpha/beta overlap and compares against a null distribution, BiGpairSEQ does not do any statistical calculations directly. Instead, BiGpairSEQ creates a simple bipartite weighted graph representing the sample plate. The distinct TCRA and TCRB sequences form the two sets of vertices. Every TCRA/TCRB pair that share a well are connected by an edge, with the edge weight set to the number of wells in which both sequences appear. (Sequences in all wells are filtered out prior to creating the graph, as there is no signal in their occupancy The problem of pairing TCRA/TCRB sequences thus reduces to the "assignment problem" of finding a maximum weight matching on a bipartite graph--the subset of vertex-disjoint edges whose weights sum to the maximum possible value.

This is a very well-studied combinatorial optimization problem, with many known solutions. The best currently-known algorithm for bipartite graphs with integer weights is from [Duan and Su][2] For a graph with m edges, n vertices per side, and maximum integer edge weight N, their algorithm runs in O(m sqrt(n) log(N)) time. With its best-known efficiency and requirement of integer weights, this algorithm is ideal for BiGpairSEQ.

Unfortunately, the qualities that make it ideal for BiGpairSEQ make it less generically useful, and it is not implemented by the graph theory library used in this simulator. So this program instead uses the Fibonacci heap-based algorithm of Fredman and Tarjan, which has a worst-case runtime of O(n (n log(n) + m)).

TODO

  • Try invoking GC at end of workloads
  • Hold graph data in memory until another graph is read-in
  • Enable GraphML output in addition to serialized object binaries, for data portability
    • Custom vertex type with attribute for sequence occupancy?
  • Re-implement CDR1 matching method
  • Re-implement command line arguments, to enable statistical simulation studies
  • Implement Duan and Su's maximum weight matching algorithms
    • Add controllable algorithm-type parameter?
  • Test whether pairing heap (currently used) or Fibonacci heap is more efficient for current matching algorithm
    • in theory Fibonacci heap should be more efficient, but complexity overhead may eliminate theoretical advantage
    • Add controllable heap-type parameter?
  • Implement sample plates with random numbers of T cells per well
    • BiGpairSEQ is resilient to variations in well populations; pairSEQ is not

EXTERNAL LIBRARIES USED

CITATIONS

[1]: Howie, B., Sherwood, A. M., et. al. "High-throughput pairing of T cell receptor alpha and beta sequences." Sci. Transl. Med. 7, 301ra131 (2015) [2]: https://web.eecs.umich.edu/~pettie/matching/Duan-Su-scaling-bipartite-matching.pdf

Duan, R. and Su H. "A Scaling Algorithm for Maximum Weight Matching in Bipartite Graphs." (2012)

ACKNOWLEDGEMENTS

Conceived in collaboration with Dr. Alice MacQueen, who brought the original pairSEQ paper to the author's attention and explained all the biology terms the author didn't know.

AUTHOR

Eugene Fischer, 2021. UI improvements and documentation in 2022.