Recently, when working with a large-ish dataset (for my laptop) of a little over 600,000 records, I noticed that the rate at which data was being loaded to a data frame degraded over time, when continuing to add data to a single ever-growing data frame. I watched the estimated time remaining creep up from minutes, to tens of minutes, to almost an hour, and it was continuing to increase, much like the estimated time remaining of any action in Windows (the code for estimating time remaining will be at the bottom of this page).
For background, this code is reading strings from a data frame column, and using str_split to break them into individual words, then adding each individual word and its frequency in the given string (i.e. 1) into a data frame (don’t judge me for the 1 – actually in reality I’m multiplying by the number of hits for a given wiki page, but I’ve left that out here, so at the end I can group and sum the data to get total frequency).
str_freq_df <- data.frame(word=character(0), freq=numeric(0))
for (i in 1:nrow(df)) {
# Split string words into individual words and record frequency
split_title <-
str_split(df$data[i],
" ") %>% as.vector()
for (j in 1:length(split_title[[1]])) {
str_freq_df %<>%
add_row(word=split_title[[1]][j],
freq=1*df$count_views[i])
}
}
Unfortunately, the performance of this loop gets slower and slower as more data is added to the data frame.
I noticed that the performance of tibble::add_rows() was quite fast in the beginning, so I experimented with adding 10,000 rows to a temporary data frame, and then adding those 10,000 rows to the master data frame, and then repeating this until I have parsed all of the data from the source data frame “df”.
str_freq_df <- data.frame(word=character(0), freq=numeric(0))
df_tmp <- data.frame(word=character(0), freq=numeric(0))
for (i in 1:nrow(df)) {
# Add temp dataset to master dataset every 10,000 rows.
# After adding temp to master, clear the temp dataset.
if (i %% 10000 == 0) {
str_freq_df <- rbind(str_freq_df, df_tmp)
df_tmp <- data.frame(word=character(0), freq=numeric(0))
}
# Split string words into individual words and record frequency
split_title <-
str_split(df$data[i],
" ") %>% as.vector()
for (j in 1:length(split_title[[1]])) {
df_tmp %<>%
add_row(word=split_title[[1]][j],
freq=1)
}
}
This proved to be significantly faster, with very little degradation in performance over time. In fact, the estimated time remaining actually took a downward trajectory as data was loaded, which was a huge relief. What was going to take several hours was complete in 15 minutes.
And, as promised, here is the full code I was using to report progress, duration, and estimated time remaining as the data was processed (it’s not pretty, but it works)
str_freq_df <- data.frame(word=character(0), freq=numeric(0))
df_tmp <- data.frame(word=character(0), freq=numeric(0))
start_time <- now()
dfrows <- nrow(wikistats_clean)
for (i in 1:nrow(df)) {
# Add temp dataset to master dataset every 10,000 rows.
# After adding temp to master, clear the temp dataset.
if (i %% 10000 == 0) {
str_freq_df <- rbind(str_freq_df, df_tmp)
df_tmp <- data.frame(word=character(0), freq=numeric(0))
}
# Split string words into individual words and record frequency
split_title <-
str_split(df$data[i],
" ") %>% as.vector()
for (j in 1:length(split_title[[1]])) {
df_tmp %<>%
add_row(word=split_title[[1]][j],
freq=1)
}
# Output progress, duration, and estimated time remaining
if(i %% 1000 == 0) {
frac <- i/dfrows
pct <- (frac*100) %>% round(2)
print(paste0("rows: ", i, "; pct: ", pct, "%; ", now()))
curr_duration <- now() - start_time
print("Time lapsed:")
print(curr_duration)
est_duration = (dfrows * (1-frac)) * (curr_duration/i)
print("Estimated time remaining:")
print(est_duration)
}
}
This returns something like this as the processing progresses.
[1] "rows: 508000; pct: 83.61%; 2023-10-12 02:18:26.726105"
[1] "Time lapsed:"
Time difference of 13.13106 mins
[1] "Estimated time remaining:"
Time difference of 2.57498 mins
[1] "rows: 509000; pct: 83.77%; 2023-10-12 02:18:29.369841"
[1] "Time lapsed:"
Time difference of 13.17511 mins
[1] "Estimated time remaining:"
Time difference of 2.552658 mins
[1] "rows: 510000; pct: 83.93%; 2023-10-12 02:18:32.411559"
[1] "Time lapsed:"
Time difference of 13.22593 mins
[1] "Estimated time remaining:"
Time difference of 2.531547 mins
Leave a comment