This is my first post so please go easy on me ;D
For some research that I am involved in, we have generated two area measurements for a spinal cord section. The smaller measurement refers to a cavity formed by injury, and the larger area is the entire spinal cord. These measurements were made in Photoshop and exported with the same document name, but clearly different values.
For example,
$`T7-B9_TileScan_005_Merging001_ch00.tif`
Label Document Area
1827 Measurement 39 T7-B9_TileScan_005_Merging001_ch00.tif 92,041.52
1831 Measurement 40 T7-B9_TileScan_005_Merging001_ch00.tif 3,952,865.00
This is actually a simplified version that I have created using the subset function of R to remove data. The reason I have to do this is because the range of scar areas overlaps the range of total cord areas, meaning they can’t be filtered with a simple size exclusion.
My example data set can be found here.
To generate this, please follow my [EDITED] work here.
Scar.Ablation.Data <- read.csv("/Scar Ablation Data.csv", stringsAsFactors=F)
Adding stringsAsFactors=F corrected an error generated later on.
test1 <- subset(Scar.Ablation.Data, Count != "", select = c(Label,Document,Area))
Removes all data that has no Count value. When Photoshop exported the data, it did so with redundant measurements. However all of these redundant measurements contained no Count value, and thus they can be removed with this. The proposed alternative method did not work as R did not read no value in the Count column in as NA.
fileList = split(test1,test1$Document)
Generates a list where measurements are separated by Document name.
spineAreas = lapply(fileList, function(x) x[which(x$Area==max(x$Area)), ])
Takes each list (representing all the data for a given file name) and then finds and returns the data in the row with the largest area for each file.
scarAreas = lapply(fileList, function(x) x[which(x$Area==min(x$Area)), ])
We want the data from all rows whose area are less then the largest area, for each file. Lapply returns a list, so now we want to turn them back into dataFrames
spineData = do.call(rbind,spineAreas)
scarData = do.call(rbind,scarAreas)
row.names(spineData)=NULL #clean up row names
row.names(scarData)=NULL
write.csv(scarData, "/scarData.csv")
write.csv(spineData, "/spineData.csv")
When comparing my exports, the following problems arose:
spineData contained Null values, but scarData did not.
This was resolved by switching x$Area<max
to x$Area==min
in the scarArea
‘s function. The output, while still incorrect, did not change from this modification.
- The comparison between Areas does not always work. For example, for sample “C1-B3_TileScan_002_Merging001_ch00.tif”, the scar reported a higher area than the cord.
I tried to try a different method of comparison using the aggregate()
function, but this returned data that was exactly the same as the data generated with the above method. However R is calculating these comparisons, it believes it is making the correct decision. This may indicate that there is some sort of formatting or import problem with my numerical Area values.
spineAreas2 = aggregate(Area ~ Document, data = test1, max)
scarAreas2 = aggregate(Area ~ Document, data = test1, min)
spineData2 = do.call(rbind,spineAreas2)
scarData2 = do.call(rbind,scarAreas2)
row.names(spineData2)=NULL #clean up row names
row.names(scarData2)=NULL #clean up row names
do.call(rbind, lapply(spineAreas, data.frame, stringsAsFactors=FALSE))
do.call(rbind, lapply(scarAreas, data.frame, stringsAsFactors=FALSE))
#Then clean up row names as in first example, or pass row.names=F
#when writing to a .csv file
write.csv(scarData2, "C/scarData2.csv")
write.csv(spineData2, "CspineData2.csv")
I am fine with swapping Null for 0 or NA, and I may try to do this in order to solve this problem. Thank you @Cole for your continued help through this problem, it is greatly appreciated.
2
Answers
Ok, so if I understand you correctly, you want to a) clean the data (which you have already done) then b) divide the data by file name (also already done) then finally c) compare area measurements within each file type, the smaller ones are the scars, the largest one is the spinal column. You want to sort each one into an individual list, one for scar data, the other for spinal column data (the problem).
To do this we are going to use the lapply function. It takes each element of a matrix, array, or data frame and applies a function to it. Here we write our own function. It takes each list (representing all the data for a given file name) and then finds and returns the data in the row with the largest area for each file.
Next we do the same thing, but this time we want the smaller areas for the scars. Thus we want the data from all rows whose area are less then the largest area, for each file. This approach assumes that the largest area for each file is the spinal cord crossection, and all other areas represent scars.
Lapply returns a list, so now we want to turn them back into
dataFrames
.The above approach will turn each string into a factor in your
dataFrame
. If you don’t want them as factors (occasionally can cause problems as they don’t play nice with some functions) then you can do the following.Let me know if this is what you where trying to accomplish.
Summary of the problem
Now that I have a sample data set, I can see a few problems.
The first problem is that you do not have a
.csv
file.csv
stands for comma separated values, and as you can see, your file does not contain commas between values. It looks like it is atsv
or tab separated values file. In R, you want to read this in using theread.delim()
function as follows:(you may also want to consider nameing your data with a
.tsv
extension if it is indeed tab separated)After reading in the data it is apparent that
NULL
object in R (notice all caps). Usingx=="Null"
is the correct way to test for these (as you where doing before).Count
data are represented by""
values. I’m guessing this has to do with the nature of there being no values present in the.tsv
file being represented as""
since there is nothing between the tabs. Note that if you where to use a different file format, such as.csv
the""
would be read in asNA
instead. This comes down to how the Rread.xxx
functions handle different file types and is a good thing to keep in mind for the future.Count
column represents the number of ‘features’ per measurement. It appears that eachmeasurement
has ameasurement #
row that is an aggregate summary of that measurement. Then eachfeature
of themeasurement
has its own row represented bymeasurement #-Feature #
. Based on your description of the problem, you want to remove the individual ‘feature’ measurements and compare only the aggregate values for each measurement set. I’m not sure if this is what you are actually intending/want to do, so I would think carefully about why you are removeing the individualfeature
rows because they are certainly NOT duplicate/redundant values as you stated they where above.""
or"Null"
values in many of our columns that otherwise contain numeric input. This will cause all of the values in those columns to be cast ascharacter
type instead ofnumeric
. This is why the sorting from before was not working, becausemax()
works very differently oncharacters
as opposed tonumerics
. After removing the offending"" and "Null"
values we will have to cast our desired columns tonumeric
data types.,
and.
in its numbers. R does not like,
‘s in its numbers and will not know how to interpret them. Thus, we will need to remove themIn Summary:
.tsv
)"Null"
values (see note below)feature
measurements, keeping only the aggregate data for eachmeasurement
set.,
from numbers.numeric
measurement
with the largestArea
. This represents the spinal columnmeasurement
values represent scars.A Question: Are you sure you want to separate based on file and then compare only aggregate measurements, or do you really want to separate based on measurement and then compare each feature within that measurement?
Note on previous answer
The spineData should have been the only list to contain
"Null"
values. This is because themax()
andmin()
of a data set consisting entirely of"Null"
is simply"Null"
. Thus== max(data)
will be true for each"Null"
data point (ie"Null"=="Null"
) but< max(data)
will be false for each"Null"
data point (ie."Null"< "Null"
). I really don’t think you want to use==min(data)
because then you are going to throw out all intermediate values (presumably valid scar measurements) for each file where you have non-“Null” data.If you really want to keep the
"Null"
reads in your data set, I would recommend pulling them out, processing the rest of the data, and then adding them back in at the end.Solution
Read in data.
Separate out
"Null"
measurementsRemove all
feature
measurements, keeping only aggregate data for eachmeasurement
. Keep onlyLabel
,Document
, andArea
columns.For desired columns containing numeric data, remove
,
from numbers and cast to typenumeric
.Separate data by file/
Document
name.Process each file. The largest
Area
value represents the spinal column, all other (smaller) values represent scars. Each of these statements returns a list with our desired results.Convert back to
dataFrame
. Here I have added an extra step to avoid our data being converted to factors.Add files with
"Null"
reads back in and clean up row names. Do this only when completely done analyzing dataNote Well
By adding the
"Null"
values back in, ourArea
column will be forced back to thecharacter
type since each element in a column must be of the same data type. This is important because it means that you cannot really do any meaningful operations in R on your data.For example:
spineAreas$Area>scarAreas$Area
will returnWhich might lead us to believe that we did not sort our data correctly.
However:
as.numeric(spineAreas$Area)>as.numeric(scarAreas$Area)
will returnThis indicates that the first 3 values where strings (in this case
"Null"
) which where replaced byNA
and then indicates that our data is correctly sorted.So either add the
"Null"
values back when you are completely done with data analysis, or recast your desired columns to numerics (eg.spineAreas$Area = as.numeric(spineAreas$Area)
)If you want to avoid this messy typing businesses all together (preferred)
Read in your data so that all
""
and"Null"
are represented byNA
. This will make life a lot easier, but will not save you from having to remove the,
and cast your data to numeric.Here are the lines you would need to change
This will keep your data as
numeric
even after adding back the null reads and is probably the preferred way to do things.