Zack Weinberg
2011-12-21 17:28:42 UTC
I have a data set containing approximately 1.2 million points. They
are TCP payload sizes, so 'x' takes integer values from 1 to 20 and
'y' takes integer values from 0 to approximately 1450; for practical
purposes, the x-axis should be treated as discrete and the y-axis as
continuous.
Earlier in the paper I'm working on, I present a smaller data set with
the same characteristics using stat_sum(), which comes out quite
nicely. However, when applied to the large data set, stat_sum does
not adequately reduce overplotting. All possible y-values are
observed enough times that we just see a vertical smear. stat_bin2d
reveals a bias toward the high end, and by manual choice of breaks I
can make it even better, but it is hard to compare stat_bin2d's
colored rectangles with stat_sum's variable-size dots.
What I would like to do is feed the output of stat_bin2d into
geom_point, essentially mimicking the analysis done by stat_sum but
with binning. The problem, though, is that stat_bin2d consumes the
mappings for 'x' and 'y' and produces mappings for 'xmin', 'xmax',
'ymin', 'ymax', and 'fill', which is what geom_rect wants, but *not*
what geom_point wants. The names of the stat-generated variables for
xmin/xmax/ymin/ymax are not documented, and in any case, resetting 'x'
and 'y' in the geom_point call seems to clobber the mappings needed
for input to the stat call.
Worked example with small, fake data set:
D <- data.frame(i=round(runif(1000, min=1, max=20)),
l=round(pmin(1450, pmax(0, rnorm(1000,
mean=1000, sd=600)))))
# want it to look like this, but with binning
ggplot(D, aes(x=i, y=l)) + stat_sum()
# produces rectangles
ggplot(D, aes(x=i, y=l, fill=..density..)) +
stat_bin2d(breaks=list(x=(1:21 - 0.5), y=seq(0,1500,by=50)))
# this doesn't work:
# "Error: geom_point requires the following missing aesthetics: x, y"
ggplot(D, aes(x=i, y=l, size=..density..)) +
stat_bin2d(geom='point', breaks=list(x=(1:21 - 0.5), y=seq(0,1500,by=50)))
# this doesn't work either:
# "Error in get(x, envir = this, inherits = inh) : object 'parameters'
not found"
ggplot(D, aes(x=i, y=l, size=..density..)) +
stat_bin2d(geom=geom_point(aes(x=(xmax+xmin)/2, y=(ymax+ymin)/2)),
breaks=list(x=(1:21 - 0.5), y=seq(0,1500,by=50)))
# nor does this (same error message):
ggplot(D, aes(x=i, y=l, size=..density..)) +
geom_point(aes(x=(xmax+xmin)/2, y=(ymax+ymin)/2),
stat=stat_bin2d(breaks=list(x=(1:21 - 0.5), y=seq(0,1500,by=50))))
zw
are TCP payload sizes, so 'x' takes integer values from 1 to 20 and
'y' takes integer values from 0 to approximately 1450; for practical
purposes, the x-axis should be treated as discrete and the y-axis as
continuous.
Earlier in the paper I'm working on, I present a smaller data set with
the same characteristics using stat_sum(), which comes out quite
nicely. However, when applied to the large data set, stat_sum does
not adequately reduce overplotting. All possible y-values are
observed enough times that we just see a vertical smear. stat_bin2d
reveals a bias toward the high end, and by manual choice of breaks I
can make it even better, but it is hard to compare stat_bin2d's
colored rectangles with stat_sum's variable-size dots.
What I would like to do is feed the output of stat_bin2d into
geom_point, essentially mimicking the analysis done by stat_sum but
with binning. The problem, though, is that stat_bin2d consumes the
mappings for 'x' and 'y' and produces mappings for 'xmin', 'xmax',
'ymin', 'ymax', and 'fill', which is what geom_rect wants, but *not*
what geom_point wants. The names of the stat-generated variables for
xmin/xmax/ymin/ymax are not documented, and in any case, resetting 'x'
and 'y' in the geom_point call seems to clobber the mappings needed
for input to the stat call.
Worked example with small, fake data set:
D <- data.frame(i=round(runif(1000, min=1, max=20)),
l=round(pmin(1450, pmax(0, rnorm(1000,
mean=1000, sd=600)))))
# want it to look like this, but with binning
ggplot(D, aes(x=i, y=l)) + stat_sum()
# produces rectangles
ggplot(D, aes(x=i, y=l, fill=..density..)) +
stat_bin2d(breaks=list(x=(1:21 - 0.5), y=seq(0,1500,by=50)))
# this doesn't work:
# "Error: geom_point requires the following missing aesthetics: x, y"
ggplot(D, aes(x=i, y=l, size=..density..)) +
stat_bin2d(geom='point', breaks=list(x=(1:21 - 0.5), y=seq(0,1500,by=50)))
# this doesn't work either:
# "Error in get(x, envir = this, inherits = inh) : object 'parameters'
not found"
ggplot(D, aes(x=i, y=l, size=..density..)) +
stat_bin2d(geom=geom_point(aes(x=(xmax+xmin)/2, y=(ymax+ymin)/2)),
breaks=list(x=(1:21 - 0.5), y=seq(0,1500,by=50)))
# nor does this (same error message):
ggplot(D, aes(x=i, y=l, size=..density..)) +
geom_point(aes(x=(xmax+xmin)/2, y=(ymax+ymin)/2),
stat=stat_bin2d(breaks=list(x=(1:21 - 0.5), y=seq(0,1500,by=50))))
zw