Monday, June 23, 2014

Steve Miller on Data Distributions

A nice blog post on distributions using R. 
....the first priority with a new data set revolves on determining the distribution of values for each of the attributes. Initially, we wish to see frequencies for the responses of each variable. Those give us a general sense of the data, its distribution and its quality. For categorical attributes, we prefer to visualize frequencies sorted from most to least in an unadorned graphic; for numeric attributes that assume many different values, we like histograms – and perhaps even the more sophisticated kernel density plots – to detail the shape of the data.
Ought to be some kind of rule that nothing can start until histograms are done.   Count the number of times you see an average referenced during the day with no sense of it's dispersion.  Seems like data analyst malpractice 'cept we're never the ones speaking, and only find ourselves quoted.

