skip to Main Content

This is my first post so please go easy on me ;D

For some research that I am involved in, we have generated two area measurements for a spinal cord section. The smaller measurement refers to a cavity formed by injury, and the larger area is the entire spinal cord. These measurements were made in Photoshop and exported with the same document name, but clearly different values.
For example,

$`T7-B9_TileScan_005_Merging001_ch00.tif`
              Label                               Document         Area
1827 Measurement 39 T7-B9_TileScan_005_Merging001_ch00.tif    92,041.52
1831 Measurement 40 T7-B9_TileScan_005_Merging001_ch00.tif 3,952,865.00

This is actually a simplified version that I have created using the subset function of R to remove data. The reason I have to do this is because the range of scar areas overlaps the range of total cord areas, meaning they can’t be filtered with a simple size exclusion.

My example data set can be found here.
To generate this, please follow my [EDITED] work here.

Scar.Ablation.Data <- read.csv("/Scar Ablation Data.csv", stringsAsFactors=F)

Adding stringsAsFactors=F corrected an error generated later on.

test1 <- subset(Scar.Ablation.Data, Count != "", select = c(Label,Document,Area))

Removes all data that has no Count value. When Photoshop exported the data, it did so with redundant measurements. However all of these redundant measurements contained no Count value, and thus they can be removed with this. The proposed alternative method did not work as R did not read no value in the Count column in as NA.

fileList = split(test1,test1$Document)

Generates a list where measurements are separated by Document name.

spineAreas = lapply(fileList, function(x) x[which(x$Area==max(x$Area)), ])

Takes each list (representing all the data for a given file name) and then finds and returns the data in the row with the largest area for each file.

scarAreas = lapply(fileList, function(x) x[which(x$Area==min(x$Area)), ])

We want the data from all rows whose area are less then the largest area, for each file. Lapply returns a list, so now we want to turn them back into dataFrames

spineData = do.call(rbind,spineAreas)
scarData = do.call(rbind,scarAreas)
row.names(spineData)=NULL #clean up row names
row.names(scarData)=NULL
write.csv(scarData, "/scarData.csv")
write.csv(spineData, "/spineData.csv")

When comparing my exports, the following problems arose:

  1. spineData contained Null values, but scarData did not.

This was resolved by switching x$Area<max to x$Area==min in the scarArea‘s function. The output, while still incorrect, did not change from this modification.

  1. The comparison between Areas does not always work. For example, for sample “C1-B3_TileScan_002_Merging001_ch00.tif”, the scar reported a higher area than the cord.

I tried to try a different method of comparison using the aggregate() function, but this returned data that was exactly the same as the data generated with the above method. However R is calculating these comparisons, it believes it is making the correct decision. This may indicate that there is some sort of formatting or import problem with my numerical Area values.

spineAreas2 = aggregate(Area ~ Document, data = test1, max)
scarAreas2 = aggregate(Area ~ Document, data = test1, min)

spineData2 = do.call(rbind,spineAreas2)
scarData2 = do.call(rbind,scarAreas2)

row.names(spineData2)=NULL #clean up row names
row.names(scarData2)=NULL #clean up row names

do.call(rbind, lapply(spineAreas, data.frame, stringsAsFactors=FALSE))
do.call(rbind, lapply(scarAreas, data.frame, stringsAsFactors=FALSE))
#Then clean up row names as in first example, or pass row.names=F 
#when writing to a .csv file

write.csv(scarData2, "C/scarData2.csv")
write.csv(spineData2, "CspineData2.csv")

I am fine with swapping Null for 0 or NA, and I may try to do this in order to solve this problem. Thank you @Cole for your continued help through this problem, it is greatly appreciated.

2

Answers


  1. Ok, so if I understand you correctly, you want to a) clean the data (which you have already done) then b) divide the data by file name (also already done) then finally c) compare area measurements within each file type, the smaller ones are the scars, the largest one is the spinal column. You want to sort each one into an individual list, one for scar data, the other for spinal column data (the problem).

    To do this we are going to use the lapply function. It takes each element of a matrix, array, or data frame and applies a function to it. Here we write our own function. It takes each list (representing all the data for a given file name) and then finds and returns the data in the row with the largest area for each file.

    spineAreas = lapply(fileList, function(x) x[which(x$Area==max(x$Area)), ])
    

    Next we do the same thing, but this time we want the smaller areas for the scars. Thus we want the data from all rows whose area are less then the largest area, for each file. This approach assumes that the largest area for each file is the spinal cord crossection, and all other areas represent scars.

    scarAreas = lapply(fileList, function(x) x[which(x$Area<max(x$Area)), ])
    

    Lapply returns a list, so now we want to turn them back into dataFrames.

    spineData = do.call(rbind,spineAreas)
    scarData = do.call(rbind,scarAreas)
    #clean up row names
    row.names(spineData)=NULL
    row.names(scarData)=NULL
    

    The above approach will turn each string into a factor in your dataFrame. If you don’t want them as factors (occasionally can cause problems as they don’t play nice with some functions) then you can do the following.

    do.call(rbind, lapply(spineAreas, data.frame, stringsAsFactors=FALSE))
    do.call(rbind, lapply(scarAreas, data.frame, stringsAsFactors=FALSE))
    #Then clean up row names as in first example, or pass row.names=F 
    #when writing to a .csv file
    

    Let me know if this is what you where trying to accomplish.

    Login or Signup to reply.
  2. Summary of the problem

    Now that I have a sample data set, I can see a few problems.

    The first problem is that you do not have a .csv file. csv stands for comma separated values, and as you can see, your file does not contain commas between values. It looks like it is a tsv or tab separated values file. In R, you want to read this in using the read.delim() function as follows:

    ablationData = read.delim("Scar Ablation Data.txt",stringsAsFactors=F)
    

    (you may also want to consider nameing your data with a .tsv extension if it is indeed tab separated)

    After reading in the data it is apparent that

    1. For ‘bad’ reads, the file contains “Null” which is different than the NULL object in R (notice all caps). Using x=="Null" is the correct way to test for these (as you where doing before).
    2. Reads with no Count data are represented by "" values. I’m guessing this has to do with the nature of there being no values present in the .tsv file being represented as "" since there is nothing between the tabs. Note that if you where to use a different file format, such as .csv the "" would be read in as NA instead. This comes down to how the R read.xxx functions handle different file types and is a good thing to keep in mind for the future.
    3. The Count column represents the number of ‘features’ per measurement. It appears that each measurement has a measurement # row that is an aggregate summary of that measurement. Then each feature of the measurement has its own row represented by measurement #-Feature #. Based on your description of the problem, you want to remove the individual ‘feature’ measurements and compare only the aggregate values for each measurement set. I’m not sure if this is what you are actually intending/want to do, so I would think carefully about why you are removeing the individual feature rows because they are certainly NOT duplicate/redundant values as you stated they where above.
    4. As mentioned above, we have "" or "Null" values in many of our columns that otherwise contain numeric input. This will cause all of the values in those columns to be cast as character type instead of numeric. This is why the sorting from before was not working, because max() works very differently on characters as opposed to numerics. After removing the offending "" and "Null" values we will have to cast our desired columns to numeric data types.
    5. Another problem with the data is that it contains both , and . in its numbers. R does not like ,‘s in its numbers and will not know how to interpret them. Thus, we will need to remove them

    In Summary:

    • Read in data (as .tsv)
    • Separate out all "Null" values (see note below)
    • Remove all individual feature measurements, keeping only the aggregate data for each measurement set.
    • Remove all , from numbers.
    • Cast columns containing only number to numeric
    • Separate the data by file name
    • Process each file
      • Find the aggregate measurement with the largest Area. This represents the spinal column
      • All other measurement values represent scars.
      • Separate the results into two different data sets. One for scars, one for spinal columns.
    • Add the “Null” values back in (see note below)

    A Question: Are you sure you want to separate based on file and then compare only aggregate measurements, or do you really want to separate based on measurement and then compare each feature within that measurement?

    Note on previous answer

    The spineData should have been the only list to contain "Null" values. This is because the max() and min() of a data set consisting entirely of "Null" is simply "Null". Thus == max(data) will be true for each "Null" data point (ie "Null"=="Null") but < max(data) will be false for each "Null" data point (ie. "Null"< "Null"). I really don’t think you want to use ==min(data) because then you are going to throw out all intermediate values (presumably valid scar measurements) for each file where you have non-“Null” data.

    If you really want to keep the "Null" reads in your data set, I would recommend pulling them out, processing the rest of the data, and then adding them back in at the end.

    Solution

    Read in data.

     data = read.delim("Scar Ablation Data.tsv",stringsAsFactors=F)
    

    Separate out "Null" measurements

    data2 = data[-which(data$Area=="Null"),]
    

    Remove all feature measurements, keeping only aggregate data for each measurement. Keep only Label, Document, and Area columns.

    data2 = data2[-which(data2$Count==""),c("Label","Document","Area")]
    

    For desired columns containing numeric data, remove , from numbers and cast to type numeric.

    data2$Area = as.numeric(gsub(",","",data2$Area))
    

    Separate data by file/Document name.

    fileList = split(data2,data2$Document)
    

    Process each file. The largest Area value represents the spinal column, all other (smaller) values represent scars. Each of these statements returns a list with our desired results.

    spineAreas = lapply(fileList, function(x) x[which(x$Area==max(x$Area)), ])
    scarAreas = lapply(fileList, function(x) x[which(x$Area<max(x$Area)), ])
    

    Convert back to dataFrame. Here I have added an extra step to avoid our data being converted to factors.

    spineAreas = do.call(rbind, lapply(spineAreas, data.frame, stringsAsFactors=FALSE))
    scarAreas = do.call(rbind, lapply(scarAreas, data.frame, stringsAsFactors=FALSE))
    

    Add files with "Null" reads back in and clean up row names. Do this only when completely done analyzing data

    nullDocs = match(unique(data$Document[data$Label=="Null"]),data$Document)
    nullDocs = data.frame(data[nullDocs,c("Label","Document","Area")],stringsAsFactors=F)
    scarAreas = rbind(nullDocs,scarAreas)
    spineAreas = rbind(nullDocs,spineAreas)
    row.names(scarAreas)=NULL
    row.names(spineAreas)=NULL
    

    Note Well

    By adding the "Null" values back in, our Area column will be forced back to the character type since each element in a column must be of the same data type. This is important because it means that you cannot really do any meaningful operations in R on your data.

    For example: spineAreas$Area>scarAreas$Area will return

    [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE
    [23]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE
    

    Which might lead us to believe that we did not sort our data correctly.

    However: as.numeric(spineAreas$Area)>as.numeric(scarAreas$Area) will return

    [1]   NA   NA   NA TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
    [28] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
    

    This indicates that the first 3 values where strings (in this case "Null") which where replaced by NA and then indicates that our data is correctly sorted.

    So either add the "Null" values back when you are completely done with data analysis, or recast your desired columns to numerics (eg. spineAreas$Area = as.numeric(spineAreas$Area))

    If you want to avoid this messy typing businesses all together (preferred)

    Read in your data so that all "" and "Null" are represented by NA. This will make life a lot easier, but will not save you from having to remove the , and cast your data to numeric.

    Here are the lines you would need to change

    data = read.delim("Scar Ablation Data.tsv",na.strings=c("NA","Null",""),stringsAsFactors=F)
    data2 = data[-which(is.na(data$Area)),]
    data2 = data2[-which(is.na(data2$Count)),c("Label","Document","Area")]
    nullDocs = match(unique(data$Document[is.na(data$Label)]),data$Document)
    

    This will keep your data as numeric even after adding back the null reads and is probably the preferred way to do things.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search