update readme

2025-04-09 11:42:10 -05:00
parent 30a3f6e33d
commit 9b2ad9da09
1 changed files with 64 additions and 65 deletions
--- a/readme.md
+++ b/readme.md
@@ -18,12 +18,12 @@
   2. [SIMULATING EXPERIMENTS FROM THE 2015 pairSEQ PAPER](#simulating-experiments-from-the-2015-pairseq-paper)
      1. [EXPERIMENT 1](#experiment-1)
      2. [EXPERIMENT 3](#experiment-3)
-6. [TODO](#todo)
+6. [CITATIONS](#citations)
-7. [CITATIONS](#citations)
+7. [EXTERNAL LIBRARIES USED](#external-libraries-used)
-8. [EXTERNAL LIBRARIES USED](#external-libraries-used)
+8. [ACKNOWLEDGEMENTS](#acknowledgements)
-9. [ACKNOWLEDGEMENTS](#acknowledgements)
+9. [AUTHOR](#author)
-10. [AUTHOR](#author)
+10. [DISCLOSURE](#disclosure)
-11. [DISCLOSURE](#disclosure)
+11. [TODO](#todo)
 ## ABOUT
@@ -592,60 +592,6 @@ pairs called in the pairSEQ experiment. These results show that at very high sam
 underlying frequency distribution drastically affect the results. The real distribution clearly has a much longer "tail" 
 than the simulated exponential distribution. Implementing a way to exert finer control over the sampling distribution from 
 the file of distinct cells may enable better simulated replication of this experiment.
 ## TODO
 * ~~Try invoking GC at end of workloads to reduce paging to disk~~ DONE
 * ~~Hold graph data in memory until another graph is read-in? ABANDONED UNABANDONED~~ DONE
  * ~~*No, this won't work, because BiGpairSEQ simulations alter the underlying graph based on filtering constraints. Changes would cascade with multiple experiments.*~~
  * Might have figured out a way to do it, by taking edges out and then putting them back into the graph. This may actually be possible.
  * It is possible, though the modifications to the graph incur their own performance penalties. Need testing to see which option is best. It may be computer-specific.
 * ~~Test whether pairing heap (currently used) or Fibonacci heap is more efficient for priority queue in current matching algorithm~~ DONE
  * ~~in theory Fibonacci heap should be more efficient, but complexity overhead may eliminate theoretical advantage~~
  * ~~Add controllable heap-type parameter?~~
    * Parameter implemented. Pairing heap the current default.
 * ~~Implement sample plates with random numbers of T cells per well.~~ DONE
  * Possible BiGpairSEQ advantage over pairSEQ: BiGpairSEQ is resilient to variations in well population sizes on a sample plate; pairSEQ is not due to nature of probability calculations.
    * preliminary data suggests that BiGpairSEQ behaves roughly as though the whole plate had whatever the *average* well concentration is, but that's still speculative.
 * ~~See if there's a reasonable way to reformat Sample Plate files so that wells are columns instead of rows.~~
  * ~~Problem is variable number of cells in a well~~
  * ~~Apache Commons CSV library writes entries a row at a time~~ 
    * Got this working, but at the cost of a profoundly strange bug in graph occupancy filtering. Have reverted the repo until I can figure out what caused that. Given how easily Thingiverse transposes CSV matrices in R, might not even be worth fixing.
 * ~~Enable GraphML output in addition to serialized object binaries, for data portability~~ DONE
  * ~~Have a branch where this is implemented, but there's a bug that broke matching. Don't currently have time to fix.~~
 * ~~Re-implement command line arguments, to enable scripting and statistical simulation studies~~ DONE
 * ~~Implement custom Vertex class to simplify code and make it easier to implement different MWM algorithms~~ DONE
  * Advantage: would eliminate the need to use maps to associate vertices with sequences, which would make the code easier to understand.
  * This also seems to be faster when using the same algorithm than the version with lots of maps, which is a nice bonus!
 * ~~Implement simulation of read depth, and of read errors. Pre-filter graph for difference in read count to eliminate spurious sequences.~~ DONE
  * Pre-filtering based on comparing (read depth) * (occupancy) to (read count) for each sequence works extremely well
 * ~~Add read depth simulation options to CLI~~ DONE
 * ~~Update graphml output to reflect current Vertex class attributes~~ DONE
  * Individual well data from the SequenceRecords could be included, if there's ever a reason for it
 * ~~Implement simulation of sequences being misread as other real sequence~~ DONE
 * Implement redistributive heap for LEDA matching algorithm to achieve theoretical worst case of O(n(m + n log C)) where C is highest edge weight.
 * Update matching metadata output options in CLI
 * Add frequency distribution details to metadata output
  * need to make an enum for the different distribution types and refactor the Plate class and user interfaces, also add the necessary fields to GraphWithMapData and then call if from Simulator
 * Update performance data in this readme
 * ~~Add section to ReadMe describing data filtering methods.~~ DONE, now part of algorithm description
 * Re-implement CDR1 matching method
 * ~~Refactor simulator code to collect all needed data in a single scan of the plate~~ DONE
  * ~~Currently it scans once for the vertices and then again for the edge weights. This made simulating read depth awkward, and incompatible with caching of plate files.~~
  * ~~This would be a fairly major rewrite of the simulator code, but could make things faster, and would definitely make them cleaner.~~
 * Implement Duan and Su's maximum weight matching algorithm
    * ~~Add controllable algorithm-type parameter?~~ DONE
    * This would be fun and valuable, but probably take more time than I have for a hobby project.
 * ~~Implement an auction algorithm for maximum weight matching~~ DONE
 * Implement a forward/reverse auction algorithm for maximum weight matching
 * Implement an algorithm for approximating a maximum weight matching
  * Some of these run in linear or near-linear time
  * given that the underlying biological samples have many, many sources of error, this would probably be the most useful option in practice. It seems less mathematically elegant, though, and so less fun for me.
 * Implement Vose's alias method for arbitrary statistical distributions of cells
  * Should probably refactor to use apache commons rng for this
 * Use commons JCS for caching
 * Parameterize pre-filtering options
 ## CITATIONS
 * Howie, B., Sherwood, A. M., et al. ["High-throughput pairing of T cell receptor alpha and beta sequences."](https://pubmed.ncbi.nlm.nih.gov/26290413/) Sci. Transl. Med. 7, 301ra131 (2015)
@@ -666,14 +612,67 @@ BiGpairSEQ was conceived in collaboration with the author's spouse, Dr. Alice Ma
 pairSEQ paper to the author's attention and explained all the biology terms he didn't know.
 ## AUTHOR
-BiGpairSEQ algorithm and simulation by Eugene Fischer, 2021. Improvements and documentation, 2022–2023.
+BiGpairSEQ algorithm and simulation by Eugene Fischer, 2021. Improvements and documentation, 2022–2025.
 ## DISCLOSURE
 The earliest versions of the BiGpairSEQ simulator were written in 2021 to let Dr. MacQueen test hypothetical extensions 
-of the published pairSEQ protocol while she was interviewing for a position at Adaptive Biotechnologies. She has been 
+of the published pairSEQ protocol while she was interviewing for a position at Adaptive Biotechnologies. She was 
-employed at Adaptive Biotechnologies since 2022.
+employed at Adaptive Biotechnologies starting in 2022.
 The author has worked on this BiGpairSEQ simulator since 2021 without Dr. MacQueen's involvement, since she has had
 access to related, proprietary technologies. The author has had no such access, relying exclusively on the 2015 pairSEQ
-paper and other academic publications. He continues to work on BiGpairSEQ
+paper and other academic publications. He continues to work on the BiGpairSEQ simulator recreationally, as it has been  
-recreationally, as it involves some very beautiful math.
+a means of exploring some very beautiful math.
 ## TODO
 * ~~Try invoking GC at end of workloads to reduce paging to disk~~ DONE
 * ~~Hold graph data in memory until another graph is read-in? ABANDONED UNABANDONED~~ DONE
    * ~~*No, this won't work, because BiGpairSEQ simulations alter the underlying graph based on filtering constraints. Changes would cascade with multiple experiments.*~~
    * Might have figured out a way to do it, by taking edges out and then putting them back into the graph. This may actually be possible.
    * It is possible, though the modifications to the graph incur their own performance penalties. Need testing to see which option is best. It may be computer-specific.
 * ~~Test whether pairing heap (currently used) or Fibonacci heap is more efficient for priority queue in current matching algorithm~~ DONE
    * ~~in theory Fibonacci heap should be more efficient, but complexity overhead may eliminate theoretical advantage~~
    * ~~Add controllable heap-type parameter?~~
        * Parameter implemented. Pairing heap the current default.
 * ~~Implement sample plates with random numbers of T cells per well.~~ DONE
    * Possible BiGpairSEQ advantage over pairSEQ: BiGpairSEQ is resilient to variations in well population sizes on a sample plate; pairSEQ is not due to nature of probability calculations.
        * preliminary data suggests that BiGpairSEQ behaves roughly as though the whole plate had whatever the *average* well concentration is, but that's still speculative.
 * ~~See if there's a reasonable way to reformat Sample Plate files so that wells are columns instead of rows.~~
    * ~~Problem is variable number of cells in a well~~
    * ~~Apache Commons CSV library writes entries a row at a time~~
        * Got this working, but at the cost of a profoundly strange bug in graph occupancy filtering. Have reverted the repo until I can figure out what caused that. Given how easily Thingiverse transposes CSV matrices in R, might not even be worth fixing.
 * ~~Enable GraphML output in addition to serialized object binaries, for data portability~~ DONE
    * ~~Have a branch where this is implemented, but there's a bug that broke matching. Don't currently have time to fix.~~
 * ~~Re-implement command line arguments, to enable scripting and statistical simulation studies~~ DONE
 * ~~Implement custom Vertex class to simplify code and make it easier to implement different MWM algorithms~~ DONE
    * Advantage: would eliminate the need to use maps to associate vertices with sequences, which would make the code easier to understand.
    * This also seems to be faster when using the same algorithm than the version with lots of maps, which is a nice bonus!
 * ~~Implement simulation of read depth, and of read errors. Pre-filter graph for difference in read count to eliminate spurious sequences.~~ DONE
    * Pre-filtering based on comparing (read depth) * (occupancy) to (read count) for each sequence works extremely well
 * ~~Add read depth simulation options to CLI~~ DONE
 * ~~Update graphml output to reflect current Vertex class attributes~~ DONE
    * Individual well data from the SequenceRecords could be included, if there's ever a reason for it
 * ~~Implement simulation of sequences being misread as other real sequence~~ DONE
 * Implement redistributive heap for LEDA matching algorithm to achieve theoretical worst case of O(n(m + n log C)) where C is highest edge weight.
 * Update matching metadata output options in CLI
 * Add frequency distribution details to metadata output
    * need to make an enum for the different distribution types and refactor the Plate class and user interfaces, also add the necessary fields to GraphWithMapData and then call if from Simulator
 * Update performance data in this readme
 * ~~Add section to ReadMe describing data filtering methods.~~ DONE, now part of algorithm description
 * Re-implement CDR1 matching method
 * ~~Refactor simulator code to collect all needed data in a single scan of the plate~~ DONE
    * ~~Currently it scans once for the vertices and then again for the edge weights. This made simulating read depth awkward, and incompatible with caching of plate files.~~
    * ~~This would be a fairly major rewrite of the simulator code, but could make things faster, and would definitely make them cleaner.~~
 * Implement Duan and Su's maximum weight matching algorithm
    * ~~Add controllable algorithm-type parameter?~~ DONE
    * This would be fun and valuable, but probably take more time than I have for a hobby project.
 * ~~Implement an auction algorithm for maximum weight matching~~ DONE
 * Implement a forward/reverse auction algorithm for maximum weight matching
 * Implement an algorithm for approximating a maximum weight matching
    * Some of these run in linear or near-linear time
    * given that the underlying biological samples have many, many sources of error, this would probably be the most useful option in practice. It seems less mathematically elegant, though, and so less fun for me.
 * Implement Vose's alias method for arbitrary statistical distributions of cells
    * Should probably refactor to use apache commons rng for this
 * Use commons JCS for caching
 * Parameterize pre-filtering options