stat_bin2d with geom_point (or, stat

Discussion:

stat_bin2d with geom_point (or, stat_sum with binning)

Zack Weinberg

2011-12-21 17:28:42 UTC

I have a data set containing approximately 1.2 million points. They
are TCP payload sizes, so 'x' takes integer values from 1 to 20 and
'y' takes integer values from 0 to approximately 1450; for practical
purposes, the x-axis should be treated as discrete and the y-axis as
continuous.

Earlier in the paper I'm working on, I present a smaller data set with
the same characteristics using stat_sum(), which comes out quite
nicely. However, when applied to the large data set, stat_sum does
not adequately reduce overplotting. All possible y-values are
observed enough times that we just see a vertical smear. stat_bin2d
reveals a bias toward the high end, and by manual choice of breaks I
can make it even better, but it is hard to compare stat_bin2d's
colored rectangles with stat_sum's variable-size dots.

What I would like to do is feed the output of stat_bin2d into
geom_point, essentially mimicking the analysis done by stat_sum but
with binning. The problem, though, is that stat_bin2d consumes the
mappings for 'x' and 'y' and produces mappings for 'xmin', 'xmax',
'ymin', 'ymax', and 'fill', which is what geom_rect wants, but *not*
what geom_point wants. The names of the stat-generated variables for
xmin/xmax/ymin/ymax are not documented, and in any case, resetting 'x'
and 'y' in the geom_point call seems to clobber the mappings needed
for input to the stat call.

Worked example with small, fake data set:

D <- data.frame(i=round(runif(1000, min=1, max=20)),
l=round(pmin(1450, pmax(0, rnorm(1000,
mean=1000, sd=600)))))

# want it to look like this, but with binning
ggplot(D, aes(x=i, y=l)) + stat_sum()

# produces rectangles
ggplot(D, aes(x=i, y=l, fill=..density..)) +
stat_bin2d(breaks=list(x=(1:21 - 0.5), y=seq(0,1500,by=50)))

# this doesn't work:
# "Error: geom_point requires the following missing aesthetics: x, y"
ggplot(D, aes(x=i, y=l, size=..density..)) +
stat_bin2d(geom='point', breaks=list(x=(1:21 - 0.5), y=seq(0,1500,by=50)))

# this doesn't work either:
# "Error in get(x, envir = this, inherits = inh) : object 'parameters'
not found"
ggplot(D, aes(x=i, y=l, size=..density..)) +
stat_bin2d(geom=geom_point(aes(x=(xmax+xmin)/2, y=(ymax+ymin)/2)),
breaks=list(x=(1:21 - 0.5), y=seq(0,1500,by=50)))

# nor does this (same error message):
ggplot(D, aes(x=i, y=l, size=..density..)) +
geom_point(aes(x=(xmax+xmin)/2, y=(ymax+ymin)/2),
stat=stat_bin2d(breaks=list(x=(1:21 - 0.5), y=seq(0,1500,by=50))))

zw

Chris Neff

2011-12-21 17:33:27 UTC

Permalink

I don't understand why you need points. Points are misleading in this
example because it isn't obvious that you are binning when displaying
a point.

In essence you have 20 classes, x, and you want to to show something
for each class. Why do you need to bin at all? Something like a box
plot or violin plot would more obviously show what is going on.

I have a data set containing approximately 1.2 million points. They
are TCP payload sizes, so 'x' takes integer values from 1 to 20 and
'y' takes integer values from 0 to approximately 1450; for practical
purposes, the x-axis should be treated as discrete and the y-axis as
continuous.
Earlier in the paper I'm working on, I present a smaller data set with
the same characteristics using stat_sum(), which comes out quite
nicely. However, when applied to the large data set, stat_sum does
not adequately reduce overplotting. All possible y-values are
observed enough times that we just see a vertical smear. stat_bin2d
reveals a bias toward the high end, and by manual choice of breaks I
can make it even better, but it is hard to compare stat_bin2d's
colored rectangles with stat_sum's variable-size dots.
What I would like to do is feed the output of stat_bin2d into
geom_point, essentially mimicking the analysis done by stat_sum but
with binning. The problem, though, is that stat_bin2d consumes the
mappings for 'x' and 'y' and produces mappings for 'xmin', 'xmax',
'ymin', 'ymax', and 'fill', which is what geom_rect wants, but *not*
what geom_point wants. The names of the stat-generated variables for
xmin/xmax/ymin/ymax are not documented, and in any case, resetting 'x'
and 'y' in the geom_point call seems to clobber the mappings needed
for input to the stat call.
D <- data.frame(i=round(runif(1000, min=1, max=20)),
l=round(pmin(1450, pmax(0, rnorm(1000,
mean=1000, sd=600)))))
# want it to look like this, but with binning
ggplot(D, aes(x=i, y=l)) + stat_sum()
# produces rectangles
ggplot(D, aes(x=i, y=l, fill=..density..)) +
stat_bin2d(breaks=list(x=(1:21 - 0.5), y=seq(0,1500,by=50)))
# "Error: geom_point requires the following missing aesthetics: x, y"
ggplot(D, aes(x=i, y=l, size=..density..)) +
stat_bin2d(geom='point', breaks=list(x=(1:21 - 0.5), y=seq(0,1500,by=50)))
# "Error in get(x, envir = this, inherits = inh) : object 'parameters'
not found"
ggplot(D, aes(x=i, y=l, size=..density..)) +
stat_bin2d(geom=geom_point(aes(x=(xmax+xmin)/2, y=(ymax+ymin)/2)),
breaks=list(x=(1:21 - 0.5), y=seq(0,1500,by=50)))
ggplot(D, aes(x=i, y=l, size=..density..)) +
geom_point(aes(x=(xmax+xmin)/2, y=(ymax+ymin)/2),
stat=stat_bin2d(breaks=list(x=(1:21 - 0.5), y=seq(0,1500,by=50))))
zw
--
You received this message because you are subscribed to the ggplot2 mailing list.
Please provide a reproducible example: http://gist.github.com/270442
More options: http://groups.google.com/group/ggplot2

--
You received this message because you are subscribed to the ggplot2 mailing list.
Please provide a reproducible example: http://gist.github.com/270442

To post: email ggplot2-/***@public.gmane.org
To unsubscribe: email ggplot2+unsubscribe-/***@public.gmane.org
More options: http://groups.google.com/group/ggplot2

Chris Neff

2011-12-21 17:48:00 UTC

Permalink

I didn't understand that there were orderings here. I don't understand
why you can't have bin2d do what you want. The issue with using
points is someone looking at it has no clue where the break is for a
given bin, or that it is even necessarily binned. Also, people usually
have an easier time with color differences than with size ones. For
instance, I think

ggplot(D, aes(x=i, y=l)) + geom_bin2d()

Looks far easier to see what is going on than

ggplot(D, aes(x=i, y=l)) + stat_sum()

does. What is it about the binned plot that you don't like? If it is
the gradient of colors, you may want to transform the color ramping
somehow, like log transform or something.

Post by Chris Neff
I don't understand why you need points. Points are misleading in this
example because it isn't obvious that you are binning when displaying
a point.
In essence you have 20 classes, x, and you want to to show something
for each class. Why do you need to bin at all? Something like a box
plot or violin plot would more obviously show what is going on.

I tried both of those. They obscure the horizontal correlations far
too much; in context, vertical precision is not terribly important,
but being able to eyeball the x-axis evolution of the pattern is
critical.
zw

Zack Weinberg

2011-12-21 17:58:04 UTC

Permalink

Post by Chris Neff
I didn't understand that there were orderings here. I don't understand
why you can't have bin2d do what you want. The issue with using
points is someone looking at it has no clue where the break is for a
given bin, or that it is even necessarily binned. Also, people usually
have an easier time with color differences than with size ones. For
instance, I think
ggplot(D, aes(x=i, y=l)) + geom_bin2d()
Looks far easier to see what is going on than
ggplot(D, aes(x=i, y=l)) + stat_sum()
does. What is it about the binned plot that you don't like? If it is
the gradient of colors, you may want to transform the color ramping
somehow, like log transform or something.

The discreteness of the x-axis is lost; the background grid is
obscured, making it harder to pick out specific y-values that are of
interest (in the real data set I manually force those to get their own
bins); the time evolution of the pattern is also much harder to see
(this is maybe not obvious with the fake data); it is inconsistent
with another figure that was done the other way; I'm not supposed to
use color (journal requirement).

Can I please just have an answer to the technical question I asked,
rather than a graphic design argument?

zw

Chris Neff

2011-12-21 18:05:52 UTC

Permalink

Post by Zack Weinberg

The discreteness of the x-axis is lost; the background grid is
obscured, making it harder to pick out specific y-values that are of
interest (in the real data set I manually force those to get their own
bins); the time evolution of the pattern is also much harder to see
(this is maybe not obvious with the fake data); it is inconsistent
with another figure that was done the other way; I'm not supposed to
use color (journal requirement).
Can I please just have an answer to the technical question I asked,
rather than a graphic design argument?

You ask for free advice and that's what you get. Too many people
don't think enough about why they are making a plot, only how.

Make a binning of the y values yourself using cut() then use those to
plot with.

Post by Zack Weinberg
zw

Hadley Wickham

2011-12-21 18:04:48 UTC

Permalink

Post by Chris Neff
given bin, or that it is even necessarily binned. Also, people usually
have an easier time with color differences than with size ones. For
instance, I think

I disagree on this point - I'm pretty sure it's easy to perceive
differences in size, rather than in colour. There hasn't be
visualisation research (that I know of) that shows this directly, but
I think it's pretty clear from what we know about the psychology of
perception.

Hadley
--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

Chris Neff

2011-12-21 18:10:38 UTC

Permalink

Post by Hadley Wickham

Post by Chris Neff
given bin, or that it is even necessarily binned. Also, people usually
have an easier time with color differences than with size ones. For
instance, I think

I thought I had read what I said somewhere, but I can't find any
support now, so I'll retract that and try to further research. I know
from personal experiences that yes it is easier to distinguish
extremes in size, but in the middling values it is harder. When I'm
looking at a plot with a size gradient and the legend shows 8 values
of it, I can tell value 1 from value 8, but it is almost impossible
for me to tell if something is value 2 or 3. And on the really
small ends it can be tough to tell the difference between "really
small" and "doesn't actually exist". Color still has the middle
distinction problem true, but at least for summarization it has a
floor.

Post by Hadley Wickham
Hadley
--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

Winston Chang

2011-12-21 18:27:55 UTC

Permalink

Cleveland (1984) did some research into how well we could interpret various
visual properties. Color is at the bottom of the list. There's a nice
diagram of it here:
http://processtrends.com/TOC_data_visualization.htm

Zack, I'd suggest binning the data yourself and then graphing the binned
data.

-Winston

Post by Chris Neff

Post by Hadley Wickham

Post by Chris Neff
given bin, or that it is even necessarily binned. Also, people usually
have an easier time with color differences than with size ones. For
instance, I think

Post by Hadley Wickham
Hadley
--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

--
You received this message because you are subscribed to the ggplot2 mailing list.
Please provide a reproducible example: http://gist.github.com/270442
More options: http://groups.google.com/group/ggplot2

Chris Neff

2011-12-21 18:29:47 UTC

Permalink

Excellent resource. Thanks Winston I shall digest this properly :)

Post by Winston Chang
Cleveland (1984) did some research into how well we could interpret various
visual properties. Color is at the bottom of the list. There's a nice
http://processtrends.com/TOC_data_visualization.htm
Zack, I'd suggest binning the data yourself and then graphing the binned
data.
-Winston

Post by Chris Neff

Post by Hadley Wickham

Post by Chris Neff
given bin, or that it is even necessarily binned. Also, people usually
have an easier time with color differences than with size ones. For
instance, I think

Post by Hadley Wickham
Hadley
--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

Hadley Wickham

2011-12-21 19:49:05 UTC

Permalink

Except if you read the paper closely, he only actually tested a subset
of the visual properties.

Hadley

--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

Continue reading on narkive:

Search results for 'stat_bin2d with geom_point (or, stat_sum with binning)' (Questions and Answers)

replies

examples of math trivias?

started 2006-12-04 18:54:22 UTC

mathematics