Work+102012

Some intervals for adding to the large table ran out of memory. It is hard to find these intervals because the text file is so large. I would like to make a Java class for handling large text files. This class could find, sort, etc.

I created the LargeTextFileHandler class and I had it try to search for lines in a text file that was greater than 1 GB. I can now see how long this will take the program.

Code used here (102112)

It takes about 1-5 min for the program to go through the whole 1 GB file 1 time.

I will readd items to the table for items 6011-8000.

I think I am starting to get a clearer idea for how I would like to analyze the data in my 1 GB table of results file. For each protein matched by the 1st motif group (e.g. 0_3i_b_1_266i, 0_3i_b_1_716i, etc.), I would like to see how many times it matched with another protein in the motif group matches. I would then like to know what percentile of matches that this number of matches falls into (e.g. is this number of matches greater than the 90% of the other numbers of matches). Then I would like to get the average and median e scores for all of the matches by that protein. Once I have these percentile numbers, I would like to sort the data so that the proteins with the greatest number of matches and the greatest median or average scores (proteins closest to the 100th percentile number of matches and 100th percentile e score corner in a graph) are ranked towards the top. I would then want to see if any of these proteins look interesting.