Archive for category Basic stats & math

Fisher-Yates shuffle for music streaming is perfectly random—too much so for some

The headline “When random is too random” caught my eye when the April issue of Significance, published by The Royal Statistical Society, circulated by me the other day.  It really makes no statistical sense, but the music-streaming service Spotify abandoned the truly random Fisher-Yates shuffle.  The problem with randomization is that it naturally produces repeats in tracks two or even three days in a row and occasionally back-to-back.  Although this happened purely by chance, Spotify consumers complained.

Along similar lines, I have been aggravated by screen savers that randomly show family photos.  It really seems that some get repeated too often even though it’s only by chance.  For a detailing of how Spotify’s software engineer Lukáš Poláček tweaked the Fisher-Yates shuffle to stretch songs out more evenly see this blog post.

“I think Fisher-Yates shuffle is one of the most beautiful random algorithms and it’s amazing that such a complicated problem can be solved in 3 lines of code in some programming languages.  And this is accomplished using the optimal number of operations and optimal amount of randomness.”

– Lukáš Poláček (who nevertheless, due to fickleness of music listeners, tweaked the algorithm to introduce a degree of unrandomization so it would reduce natural clustering)

No Comments

“naked statistics” not very revealing

One of my daughters gave me a very readable book by economist Charles Wheelan titled “naked statistics, Stripping the Dread from the Data”.  She knew this would be too simple for me, but figured I might pick up some ways to explain statistics better, which I really appreciate.  However, although I very much liked the way Wheelan keeps things simple and makes it fun, his book never did deliver any nuggets that could be mined for my teachings.  Nevertheless, I do recommend “naked statistics” for anyone who is challenged by this subject.  It helps that author is not a statistician. ; )

By the way, there is very little said in this book about experiment design.  Wheelan mentions in his chapter on “Program Evaluation” the idea of a ‘natural experiment’, that is, a situation where “random circumstances somehow create something approximating a randomized, controlled experiment.”  So far as I am concerned “natural” data (happenstance) and results from an experiment cannot be mixed, thus natural experiment is an oxymoron, but I get the point of exploiting an unusually clean contrast ripe for the picking.  I only advise continued skepticism on any results that come from uncontrolled variables.*

*Wheelan cites this study in which the author, economist Adriana Lleras-Muney, made use of a ‘quasi-natural experiment’ (her term) to conclude that “life expectancy of those adults who reached age thirty-five was extended by an extra one and a half years just by their attending one additional year of school”  (quote from Whelan).  Really!?

No Comments

Educational fun with Galton’s Bean Machine

This blog on Central limit theorem animation by Nathan Yau brought back fond memories of a quincunx (better known as a bean machine) that I built to show operators how results can vary simply by chance.  It was comprised of push-pins laid out in the form of Pascal’s triangle on to a board overlaid with clear acrylic.  I’d pour in several hundred copper-coated BB’s through a funnel and they would fall into the bins at the bottom in the form of a nearly normal curve.

Follow the link above to a virtual quincunx that you can experiment on by changing the number of bins.  To see how varying ball diameters affect the results, check out this surprising video posted by David Bulger, Senior Lecturer, Department of Statistics, Macquarie University, Sydney, Australia.

No Comments

Random thoughts

The latest issue of Wired magazine provides a great heads-up on random numbers by Jonathan Keats.  Scrambling the order of runs is a key to good design of experiments (DOE)—this counteracts the influence of lurking variables, such as changing ambient conditions.

Designing an experiment is like gambling with the devil: only a random strategy can defeat all his betting systems.

— R.A. Fisher

Along those lines, I watched with interest when weather forecasts put Tampa at the bulls-eye of the projected track for Hurricane Isaac.  My perverse thought was this might the best place to be, at least early on when the cone of uncertainty is widest.

In any case, one does best by expecting the unexpected.  That gets me back to the topic of randomization, which turns out to be surprisingly hard to do considering the natural capriciousness of weather and life in general.  When I first got going on DOE, I pulled numbered slips of paper out of my hard hat.  Then a statistician suggested I go to a phone book and cull numbers from the last 4 digits from whatever page opened up haphazardly.  Later I graduated to a table of random numbers (an oxymoron?).  Nowadays I let my DOE software lay out the run order.

Check out how Conjuring Truly Random Numbers Just Got Easier, including the background by Keats on pioneering work in this field by British (1927) and American (1947) statisticians.  Now the Australians have leap-frogged (kangarooed?) everyone, evidently, with a method that produces 5.7 billion “truly random” (how do they know?) values per second.  Rad mon!

,

No Comments

Statisticians no more—now “data scientists”

I spent a week earlier this month at the Joint Statistical Meetings (JSM)—an annual convocation of “data scientists”, as some of these number crunchers now deem themselves.  But most statisticians remain ‘old school’ as evidenced by this quote:

“Some time during the past couple of years, statistics became data sciences older, more boring sibling that always played by the rules.”

— Nathan Yau*

I tend to agree—being suspicious of changes in titles as a cover for shenanigans.  It seems to me that “data science” provides a smoke screen to take unwarranted leaps from shaky numbers.  As the shirt sold at JSM by American Statistical Association (ASA) says, “friends don’t let friends extrapolate.”

*Incorrectly attributed initially (my mistake) to Carnegie Mellon statistics professor Cosma Shalizi, who was credited by Yau for speaking up on this subject.

2 Comments

Where the radix point becomes a comma

Prompted by an ever-growing flow of statistical questions from overseas, Stat-Ease Consultant Wayne Adams, recently circulated this Wikipedia link that provides a breakdown on countries using a decimal point versus a comma for the radix point—the separator of the integer part from the fractional side of a number.

For more background on decimal styles over time and place see this Science Editor article by Amelia Williamson.  It credits Scottish mathematician John Napier* for being the first to use a period.  However, it seems that he wavered later by using a comma, thus setting the stage for this being an alternative.  Given the use of commas to separate thousands from millions and millions from billions and so on, numbers can be misinterpreted by several orders of magnitude very easily if you do not keep a sharp eye on the source.

So, all you math & stats boffins—watch it!

*As detailed in this 2009 blog I first learned of this fellow from seeing his bones on display at IBM’s Watson Research Center in New York.

No Comments

Obscurity does not equal profundity

“GOOD with numbers? Fascinated by data? The sound you hear is opportunity knocking.” This is how Steve Lohr of the New York Times leads off his article in today’s Sunday paper on The Age of Big Data. Certainly the abundance of data has created a big demand for people who can crunch numbers. However, I am not sure the end result will be nearly as profitable as employers may hope.

“Many bits of straw look like needles.”

– Trevor Hastie, Professor of Statistics, Stanford University, co-author of The Elements of Statistical Learning (2nd edition).

I take issue with extremely tortuous paths to complicated models based on happenstance data.  This can be every bit as bad as oversimplifications such as relying on linear trend lines (re Why you should be very leery of forecasts). As I once heard DOE guru George Box say (in regard to overly complex Taguchi methodologies): Obscurity does not equal profundity.

For example, Lohr touts the replacement of earned run average (ERA) with the “Siera”—Skill-Interactive Earned Run Average. Get all the deadly details here from the inventors of this new pitching performance metric. In my opinion, baseball itself is already complicated enough (try explaining it to someone who only follows soccer) without going to such statistical extremes for assessing players.

The movie “Moneyball” being up for Academy Awards is stoking the fever for “big data.” I am afraid that in the end the call may be for “money back” after all is said and done.

3 Comments

Extracting Sunbeams from Cucumbers

With this intriguing title Richard Feinberg and Howard Wainer draw readers of Volume 20, Number 4 into what might have been a dry discourse: How contributors to The Journal of Computational and Graphical Statistics rely mainly on tables to display data.  Given that “Graphical” is in the title of this publication, it begs the question on whether this method of for presenting statistics really works.

When working on the committee that developed the ASTM 1169-07 Standard Practice for Conducting Ruggedness Tests, I introduced the half-normal plot for selecting effects from two-level factorial experiments.  Most of the committee favored this, but one individual – a professor emeritus from a top school of statistics – resisted the introduction of this graphical tool.  He believed that only numerical methods, specifically analysis of variance (ANOVA) tables, could support objective decisions for model selection.  My comeback was to dodge the issue by simply using graphs and tables – this need not be an either/or choice.  Why not do both, or merge them by putting number on to graphs – the best of both worlds?

“A heavy bank of figures is grievously wearisome to the eye, and the popular mind is as incapable of drawing any useful lessons from it as of extracting sunbeams from cucumbers.”

— Economists (brothers) Farquhar and Farquhar (1891)

In their article which can be seen here Feinberg and Wainer take a different tack (path of least resistance?): Make tables look more like graphs.  Here are some of their suggestions for doing so:

  • Round data to 3 digits or less.
  • Line up comparable numbers by column, not row.
  • Provide summary statistics, in particular medians.
  • Don’t default to alphabetical or some other arbitrary order: Stratify by size or some other meaningful attribute.
  • Call out data that demands attention by making it bold and/or bigger and/or boxing it.
  • Insert extra space between rows or columns of data where they change greatly (gap).

Check out the remodeled table on arms transfers which makes it clear that, unlike the uptight USA, the laissez faire French will sell to anyone.  It would be hard to dig that nugget out of the original data compilation.

No Comments

Trouble with math & stats? Blame it on dyscalculia.

According to this article in Journal of Child Neurology “dyscalculia is a specific learning disability affecting the normal acquisition of arithmetic skills, which may stem from a brain-based disorder.  Are people born with this inability to do math in particular, but otherwise mentally capable – for example in reading and writing?  Up until now it’s been difficult to measure.  For example, my wife, who has taught preschool for several decades, observes that some of her children progress much more slowly than other.  However, she sees no differential in math versus reading – these attributes seem to be completely correlated.  The true picture may finally emerge now that Michèle M. M. Mazzocco et al published this paper on how Preschoolers’ Precision of the Approximate Number System Predicts Later School Mathematics Performance.

Certainly many great minds, particularly authors, abhor math and stats, even though they many not suffer from dyscalculia (only numerophobia).  The renowned essayist Hillaire Belloc said*

Statistics are the triumph of the quantitative method, and the quantitative method is the victory of sterility and death.

I wonder how he liked balancing his checkbook.

Meanwhile, public figures such as television newscasters and politicians, who appear to be intelligent otherwise (debatable!) say the silliest things when it comes to math and stats.  For example a U.S. governor, speaking on his state’s pension fund said that “when they were set up, life expectancy was only 58, so hardly anyone lived long enough to get any money.”**  One finds this figure of 58, the life expectancy of men in 1930 when Social Security began, cited often by pundits discussing the problems of retirement funds.  Of course this was the life expectancy at birth, in times when infant mortality remained a much higher levels than today.  According to this fact sheet by the Social Security Administration (SSA), 6.7 million Americans were aged 65 or older in 1930.  This number exhibits an alarming increase.  The SSA also provides interesting statistics on Average Remaining Life Expectancy for Those Surviving to Age 65, which show surprisingly slow gains over the decades.  I leave it to those of you who are not numerophobic (nor a sufferer of dyscalculia) to reconcile these seemingly contradictory statistical tables.

*From “On Statistics”, The Silence of the Sea, Glendalough Press, 2008 (originally published 1941).

**From “Real world Economics / Errors in economics coverage spread misunderstandings” by Edward Lotterman.

No Comments

Clickers allow students to vote on which answer is right for math questions

Yesterday I attended a fun webinar on Interactive Statistics Education by Dale Berger of Claremont Graduate University.  Because I was multitasking  (aka “continuous partial attention” — ha ha) at work while attending this webinar my report provides just the highlights.  However, you can figure out for yourself what they (the stats dept at Claremont) have to offer by going to this web page offering WISE (Web Interface for Statistics Education) tutorials and applets.*

After the presentation a number of educators brainstormed on interactive stats.  David Lane of Rice U (author of many stat applets) suggested the use of “interactive clickers” – see this short (< 2 min.) newscast, for example.  I wonder what happen when a majority vote for the wrong answer?  For some teachers it might be easiest just to declare the most popular response as the correct answer.  That would be consistent with the way things seem to be going in politics nowadays. ; )

*Just for fun try the Investigating the Central Limit Theorem (CLT) applet (click the link from the page referenced above or simply click here).  This would be a good applet to provide when illustrating CLT using dice (such as is done in this in-class exercise developed by two professors from De Anza College). In this case, pick the uniform Population and sample size 2.  Then Draw a Sample repeatedly, and, finally, just Draw 100 samples.  Repeat this exercise with sample size 5 a la the game of Yahtzee (a favorite in my youth). Notice how as n goes up the distribution of averages becomes more normal and narrower. That’s the power of averaging.

No Comments