While comparing the efficiency of two functions in an answer to Check if list contains another list in R, I stumbled upon an interesting result. Sorting greatly increases the efficiency of
duplicated when the vector is large. This came as a surprise as I had never noticed a considerable difference in my own work using
duplicated. Indeed, for sizes that I work with everyday, there isn't a difference. Observe:
set.seed(1007) s1 <- sample(10^2, 10^3, replace = TRUE) s1_sort <- sort(s1) library(microbenchmark) microbenchmark(dp=duplicated(s1), dp_sort=duplicated(s1_sort), times=1000) Unit: microseconds expr min lq mean median uq max neval cld dp 16.459 16.9425 22.06371 17.2965 22.5050 1541.137 1000 a dp_sort 17.007 17.5005 25.54953 17.8200 23.3655 1549.198 1000 a
As you can see, there is no noticeable difference in timings when the vector is sorted. However, on very large vectors, the results are much different. Observe:
s2 <- sample(10^6, 10^7, replace = TRUE) s2_sort <- sort(s2) microbenchmark(dp=duplicated(s2), dp_sort=duplicated(s2_sort), times=100) Unit: milliseconds expr min lq mean median uq max neval cld dp 816.6883 847.9231 869.6829 861.8210 882.3978 1019.6339 100 b dp_sort 287.6779 305.4779 322.8830 315.1198 324.9249 449.1734 100 a
Almost 3x faster!!! This led me down the rabbit hole, which began here: r-source.../duplicated.R. From here we see that duplicated makes a call to
.Internal(duplicated(x,...)). Then using the function
pryr::show_c_source(.Internal(duplicated(x))) and the workaround suggested by @joran (
show_c_source is currently giving issues.. see Is 'show_c_source()' borken?), we see that
duplicated makes a call to do_duplicated. Finally, the heart of
duplicated is revealed (It starts at line 667 and ends at 988). It appears that the entire vector is looped over and then some hashing occurs:
724 /* count unique entries */ 725 k = 0; 726 for (i = 0; i < n; i++) 727 if (LOGICAL(dup)[i] == 0) 728 k++; 776 /* Build a hash table, ignoring information on duplication */ 777 static void DoHashing(SEXP table, HashData *d)
I don't fully understand all of the code, but it seems like sorting shouldn't matter. We loop over the entire vector in either case (sorted vs. non-sorted) and ultimately call an assortment of hash functions, which shouldn't depend on whether a vector is sorted or not. My initial thought was that some sort of branch prediction was going on (see this question), but from the update to this answer, it seems that these things shouldn't matter any more.
What's going on??
The gap seems to increase as both the size of the vector and the number of duplicates increases.
set.seed(496) s3 <- sample(10^6, 10^8, replace = TRUE) s3_sort <- sort(s3) microbenchmark(dp=duplicated(s3), dp_sort=duplicated(s3_sort), times = 10) Unit: seconds expr min lq mean median uq max neval cld dp 12.149932 12.175665 12.848843 12.495599 12.719861 15.589190 10 b dp_sort 2.395636 2.401837 2.706674 2.551375 2.677556 4.373653 10 a
As @alexis_laz pointed out, if there are no duplicates, the impact of sorting is greatly reduced.
s4 <- sample(10^8) s4_sort <- sort(s4) microbenchmark(dp=duplicated(s4), dp_sort=duplicated(s4_sort), times = 10) Unit: seconds expr min lq mean median uq max neval cld dp 8.013995 8.130565 8.593626 8.197501 8.438703 10.639452 10 b dp_sort 6.135788 6.158140 6.751101 6.256739 7.241381 8.913507 10 a