[SOLVED] After custom function, calling the object in R console produces desired result whilst "View" the object from environment doesn't

Issue

I wrote a function to use aggregate to both sum values of a specific column, and count the number of rows of that column, categorised by the values of an adjacent column (in this case numbers between 6-12). The column to Sum & Count is called Count. and the column to factorise by is called CharLen.

Two tiny dfs then placed in a list

# Test df1 
    {
  Seq1 <- as.character(rep(c("AAA", "BBB", "CCC"),times = 4))
  Count1 <-  rep(c(12,56,3),times = 4)  
  CharLen1 <- c(6,6,6,7,7,7,9,11,12,8,10,9)
  Testdf1 <- data.frame(Seq1, Count1, CharLen1); colnames(Testdf1) <- c("Seq", "Count", "CharLen")
    rm(Seq1)
    rm(Count1)
    rm(CharLen1)
}

# Test df2  
 {
  Seq2 <- as.character(c("DDD", "EEE", "FFF", "AAA", "BBB", "GGG", "AAA", "BBB", "CCC", "AAA", "BBB", "CCC"))
  Count2 <-  rep(c(7,3,15),times = 4)  
  CharLen2 <- c(8,6,8,7,12,12,12,11,12,8,10,9)
  Testdf2 <- data.frame(Seq2, Count2, CharLen2); colnames(Testdf2) <- c("Seq", "Count", "CharLen")
    rm(Seq2)
    rm(Count2)
    rm(CharLen2)
}


# List these dataframes together  
  List_of_dfs <- lapply(ls(pattern="Testdf[0-9]+"), function(x) get(x))

I wrote this into a function for the purpose of passing it a list of a large number of large and different row-length data frames.
(the data frames always have the same column number, name, and value type)
"List_of_dfs"

Function

SumCountFunction <- function(i) {
    aggregate(Count ~ CharLen, data=i, FUN = function(x) c(Sum=sum(x), 
    Count=length(x)))
}

lapply the function to list of dfs

SummayCountOut <- lapply(List_of_dfs, SumCountFunction)

Once done I extract this to a single Summary Df

SummaryDf <- do.call("rbind", SummayCountOut)

Then add a numerical ID corresponding to the original dataframe position within the original List_of_dfs

SummaryDf[["SampleNumber"]] <- rep(seq_along(SummayCountOut), sapply(SummayCountOut, nrow))
    

My question and confusion is this:

  • When I generate "SummayCountOut" the console correctly shows two new columns of data: "Count.Sum" & "Count.Count".
  • When I convert to the single large summary dataframe "SummaryDf" this also shows correct data.
  • But when I View(SummaryDf) instead of calling SummaryDf direct, the two new columns I need have disappeared.

From what I can find this is due to the object only residing while the function is called? I tried using "return" as found in another SO thread but this didn’t retain the new columns, and the only other thing I found was "<<-" which others here have stated is inherently evil.

Originally I was piping in dplyr using group_by and summary functions. I couldn’t get dplyr code into a function though (I think due to NSE or lazy eval?), hence wishing to use base R instead.

Solution

Basically, your SumCountFunction produces an embedded matrix of two columns and not a flat dataframe. You can see this with str() call where Count is a matrix of 14 rows, 2 columns:

str(SummaryDf)

# 'data.frame': 14 obs. of  2 variables:
#  $ CharLen: num  6 7 8 9 10 11 12 6 7 8 ...
#  $ Count  : num [1:14, 1:2] 71 71 12 15 56 56 3 3 7 29 ...
#   ..- attr(*, "dimnames")=List of 2
#   .. ..$ : NULL
#   .. ..$ : chr  "Sum" "SCount"'data.frame':   14 obs. of  2 variables:

The challenge is aggregate() runs one grouping aggregation at time. When using c() you are casting both aggregates together into a matrix.

Consider merging two or more than two separate aggregate calls and then rename columns to avoid the Count (original dataframe column) repeat.

# TWO-DF MERGE
SumCountFunction <- function(i) {
  merge(aggregate(Count ~ CharLen, data=i, FUN = sum),
        aggregate(Count ~ CharLen, data=i, FUN = length),
        by = "CharLen")       
}

# CHAIN MERGE (ALTERNATIVE)
SumCountFunction <- function(i) {
  dfs <- lapply(c('sum', 'length'), function(f) aggregate(Count ~ CharLen, data=i, FUN = f))
  Reduce(function(x, y) merge(x, y, by = "CharLen"), dfs)

}

SummaryDf <- setNames(do.call("rbind", SummayCountOut), 
                      c("CharLen", "Count.Count", "Count.Sum"))
str(SummaryDf)

# 'data.frame': 14 obs. of  3 variables:
#  $ CharLen    : num  6 7 8 9 10 11 12 6 7 8 ...
#  $ Count.Count: num  71 71 12 15 56 56 3 3 7 29 ...
#  $ Count.Sum  : int  3 3 1 2 1 1 1 1 1 3 ...

Answered By – Parfait

Answer Checked By – Cary Denson (BugsFixing Admin)

Leave a Reply

Your email address will not be published. Required fields are marked *