Refactor+Entropy+Code+5-29-13

I would like to refactor my entropy code so that it is a little cleaner and can handle a wide variety of situations. I want to be able to prepare the list of numbers from multiple sources. Then I want to take the lists and be able to perform various types of calculations.

I want to be able to prepare the data from -gpr file -tab delimited text file (raw data) -tab delimited text file (normalized data (I could use median and Combat normalized data))

From this data I want to produce a file containing -a raw number list -a normalized number list and a normalized number list converted to integers

I want to take these number lists and calculate -entropy (raw) -entropy (normalized (from Combat and median normalized data)) -CV (raw) -CV (normalized (from Combat and median normalized data))

I want to test my code to make sure that it is working with a few small lists of numbers: one kind of random, one extreme one populated with all of the highest numbers, one extreme one populated with all of the lowest numbers, one with no duplicates so that there would be no "bin" with more than one item. mini gpr file here: "F:\kurt\storage\CIM Research Folder\DR\2013\5-29-13\entropy\Mini_gpr.gpr" mini tab delimited text file raw data here: "F:\kurt\storage\CIM Research Folder\DR\2013\5-29-13\entropy\tab delimited raw.xlsx"

mini tab delimited text file normalized data here: "F:\kurt\storage\CIM Research Folder\DR\2013\5-29-13\entropy\tab delimited normalized.xlsx"

I also copied these files to the shared drive so that programs on other computers can access them "S:\Research\Cancer_Eradication\Discovering tumor specific antigens\entropy\5-29-13\entropy"

one kind of random dataset random 65535, 861, 65535, 861, 65535, 556, 65535, 956, 255, 1, 1, 1, 255

one with random with no high or low random_nhl 235, 861, 235, 861, 235, 556, 80, 956, 255, 42000, 42000, 42000, 255

one extreme with highest numbers all_high 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535, 65535

one extreme with lowest numbers all_low 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

one with no duplicates all_different 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 65535

These entropy test cases with their entropy and cv values can be found here "F:\kurt\storage\CIM Research Folder\DR\2013\5-29-13\entropy\entropy test cases 5-29-13.xlsx"

I'll also test a real list from a gpr with 10,000 values The original name of the gpr file was 1009951_bot_N-19(152)_08132012.gpr which came from the 2012 good gprs diseases 1-8 folder I renamed the gpr to "F:\kurt\storage\CIM Research Folder\DR\2013\5-29-13\entropy\test.gpr"

6-11-13

Now that I have my test data ready to go, I can rewrite the code. I should make sure I have the previous code copied to a safe place.

see also how to calculate information entropy in excel

6-13-13

Alright I've basically refactored the code. Code found here "F:\kurt\storage\CIM Research Folder\DR\2013\6-13-13\entropy program\entropy of array src 6-13-13.zip"

Now I can go through and look at my test cases and fix errors.

I copied the test case code to here S:\Research\Cancer_Eradication\Discovering tumor specific antigens\entropy\6-13-13 from "S:\Research\Cancer_Eradication\Discovering tumor specific antigens\entropy\5-29-13\entropy"

Now I can test out the code.

I tested the code and fixed some mistakes. I also measured the time it took on different systems. Here's a message I sent to Lu Wang to test on his computer as well.

message to Lu Wang

{

Hi Lu, I'm sharing these two files with you. Maybe you could help me see how fast my program runs on your system. So far I have run the program on two systems. time taken to run a test gpr with Pentium 4 CPU 3.4 GHz 2 GB RAM system (your old computer now at the far north wall of our lab)

2013/06/15 19:06:22

2013/06/15 19:14:55

8m33s time taken to run a test gpr with AMD Phenom II X6 1055T CPU 2.8 GHz 8 GB RAM system (my personal computer at my apartment)

2013/06/15 19:49:00

2013/06/15 19:51:31

2m31s Let's see how fast your system will take. This will just take a little bit of your time. Here are the instructions.

-Start eclipse. File->New java project. Enter project name as EntropyOfArray. Navigate to the src file for the project on your hard drive and paste the src files there.

-Right click on the src file under EntropyOfArray in Java and click Refresh and now all of the src files should show up.

-Place the test.gpr somewhere on your hard drive.

-open the Test_Immunosignature_Data_030413 class and change the String directory line so that the proper directory with the test.gpr file on your hard drive is listed. Make sure the filepath string is surrounded by quotes "" and that every backslash \ is actually two backslashes \\.

-Click the green arrow at the top of the Eclipse IDE editor to run the program. Program should run for several minutes.

-Copy the text output of the program in the console which states the time the program took and send it to me. If you could send me the name of your processor, the number of GHz, and the amount of RAM that would be great too.

-Go to the folder titled entropy inside of the folder that you put test.gpr into. Open test_details.txt and send me the number that is listed after Entropy of Distribution: Thanks a lot for helping me out! I hope this doesn't take too much of your time. Let me know if you have any questions at all. Best,

Kurt

}

Here's the specs of Lu's system

Hi Kurt, 2013/06/15 20:18:29

2013/06/15 20:19:22

53s;

Entropy of Distribution: 6.2099199905835425 the processor of my computer is i7-3770, with 3.9GHz, the RAM the program took is about 2.5G. My computer has 32G of RAM and usually there are 16GB RAM free Best,

Lu