# Archive for category Basic stats & math

### Errors, blunders & lies

Posted by mark in Basic stats & math, pop on June 27, 2017

David S. Salsburg, author of “The Lady Tasting Tea”*, which I enjoyed greatly, hits the spot again with his new book on Errors, Blunders & Lies-How to Tell the Difference. It’s all about a fundamental statistical equation: Observation = model + error. The errors, of course, are normal and must be expected. But blunders and lies cannot be tolerated.

The section on errors concludes with my favorite chapter: “Regression and Big Data”. There Salsburg endorses my favorite way to avoid over-fitting of happenstance results—hold back at random 10 percent of the data and see how well these outcomes are predicted by the 90 percent you regress.** Whenever I tried this on manufacturing data it became very clear that our high-powered statistical models worked very well for predicting what happened last month. 😉 They were worthless for seeing into the future.

Another personal favorite is the bit on spurious correlations that Italian statistician Carlo Bonferroni*** guarded against, also known as the “will of the wisps” per the founder of Yale’s statistics school—Francis Anscombe.

If you are looking for statistical insights that come without all the dreary mathematical details, this book on “Errors, Blunders & Lies” will be just the ticket. Salsburg concludes with a timely heads-up on the statistical lies caused “curbstoning” (reported here by the __New York Post__), which may soon combine with gerrymandering (see my previous post) to create a perfect storm of data tampering in the upcoming census. We’d all do well to sharpen up our savvy on stats!

The old saying is that “figures will not lie,” but a new saying is “liars will figure.” It is our duty, as practical statisticians, to prevent the liar from figuring; in other words, to prevent him from perverting the truth, in the interest of some theory he wishes to establish.

– Carroll D. Wright, U.S. government statistician, speaking to 1889 Convention of Commissioners of Bureaus of Statistics of Labor.

*Based on the story told here.

**An idea attributed to the inventor of modern day statistics—R. A. Fisher, and endorsed by famed mathematician John Tukey, who suggested the hold-back be 10 percent.

***See my blog on Bonferroni of Bergamo.

### Models responsible for whacky weather

Posted by mark in Basic stats & math, pop, science on August 14, 2016

Watching Brazilian supermodel Gisele Bundchen sashay across the Olympic stadium in Rio reminded me that, while these fashion plates are really dishy to view, they can be very dippy when it comes to forecasting. Every time one of our local weather gurus says that their models are disagreeing, I wonder why they would ask someone like Gisele. What does she and her like know about meteorology?

There really is a connection of fashion and statistical models—the random walk. However, this movement would be more like that of a drunken man than a fashionably-calculated stroll down the catwalk. For example, see this video by an MIT professor showing 7 willy-nilly paths from a single point.

Anyways, I am wandering all over the place with this blog. Mainly I wanted to draw your attention to the Monte Carlo method for forecasting. I used this for my MBA thesis in 1980, burning up many minutes of very expensive main-frame computer time in the late ‘70s. What got me going on this whole Monte Carlo meander is this article from yesterday’s *Wall Street Journal*. Check out how the European models did better than the Americans on predicting the path of Hurricane Sandy. Evidently the Euros are on to something as detailed in this *Scientific American* report at the end of last year’s hurricane season.

I have a random thought for improving the American models—ask Cindy Crawford. She graduated as valedictorian of her high school in Illinois and earned a scholarship for chemical engineering at Northwestern University. Cindy has all the talents to create a convergence of fashion and statistical models. That would be really sweet.

### Big data puts an end to the reign of statistics

Posted by mark in Basic stats & math on March 3, 2016

Michael S. Malone of the *Wall Street Journal* proclaimed last month* that

One of the most extraordinary features of big data is that it signals the end of the reign of statistics. For 400 years, we’ve been forced to sample complex systems and extrapolate. Now, with big data, it is possible to measure everything…

Based on what I’ve gathered (admittedly only a small and probably unrepresentative sample), I think this is very unlikely. Nonetheless, if I were a statistician, I would reposition myself as a “Big Data Scientist”.

*”The Big-Data Future Has Arrived”, 2/22/16.

### Fisher-Yates shuffle for music streaming is perfectly random—too much so for some

Posted by mark in Basic stats & math, Consumer behavior on July 28, 2015

The headline “When random is too random” caught my eye when the April issue of *Significance*, published by The Royal Statistical Society, circulated by me the other day. It really makes no statistical sense, but the music-streaming service Spotify abandoned the truly random Fisher-Yates shuffle. The problem with randomization is that it naturally produces repeats in tracks two or even three days in a row and occasionally back-to-back. Although this happened purely by chance, Spotify consumers complained.

Along similar lines, I have been aggravated by screen savers that randomly show family photos. It really seems that some get repeated too often even though it’s only by chance. For a detailing of how Spotify’s software engineer Lukáš Poláček tweaked the Fisher-Yates shuffle to stretch songs out more evenly see this blog post.

“I think Fisher-Yates shuffle is one of the most beautiful random algorithms and it’s amazing that such a complicated problem can be solved in 3 lines of code in some programming languages. And this is accomplished using the optimal number of operations and optimal amount of randomness.”

– Lukáš Poláček (who nevertheless, due to fickleness of music listeners, tweaked the algorithm to introduce a degree of unrandomization so it would reduce natural clustering)

### “naked statistics” not very revealing

Posted by mark in Basic stats & math, pop on July 18, 2013

One of my daughters gave me a very readable book by economist Charles Wheelan titled “naked statistics, Stripping the Dread from the Data”. She knew this would be too simple for me, but figured I might pick up some ways to explain statistics better, which I really appreciate. However, although I very much liked the way Wheelan keeps things simple and makes it fun, his book never did deliver any nuggets that could be mined for my teachings. Nevertheless, I do recommend “naked statistics” for anyone who is challenged by this subject. It helps that author is not a statistician. ; )

By the way, there is very little said in this book about experiment design. Wheelan mentions in his chapter on “Program Evaluation” the idea of a ‘natural experiment’, that is, a situation where “random circumstances somehow create something approximating a randomized, controlled experiment.” So far as I am concerned “natural” data (happenstance) and results from an experiment cannot be mixed, thus natural experiment is an oxymoron, but I get the point of exploiting an unusually clean contrast ripe for the picking. I only advise continued skepticism on any results that come from uncontrolled variables.*

*Wheelan cites this study in which the author, economist Adriana Lleras-Muney, made use of a ‘quasi-natural experiment’ (her term) to conclude that “life expectancy of those adults who reached age thirty-five was extended by an extra one and a half years just by their attending one additional year of school” (quote from Whelan). Really!?

### Educational fun with Galton’s Bean Machine

Posted by mark in Basic stats & math on June 3, 2013

This blog on Central limit theorem animation by Nathan Yau brought back fond memories of a quincunx (better known as a bean machine) that I built to show operators how results can vary simply by chance. It was comprised of push-pins laid out in the form of Pascal’s triangle on to a board overlaid with clear acrylic. I’d pour in several hundred copper-coated BB’s through a funnel and they would fall into the bins at the bottom in the form of a nearly normal curve.

Follow the link above to a virtual quincunx that you can experiment on by changing the number of bins. To see how varying ball diameters affect the results, check out this surprising video posted by David Bulger, Senior Lecturer, Department of Statistics, Macquarie University, Sydney, Australia.

### Random thoughts

Posted by mark in Basic stats & math, design of experiments, Uncategorized on August 26, 2012

The latest issue of *Wired *magazine provides a great heads-up on random numbers by Jonathan Keats. Scrambling the order of runs is a key to good design of experiments (DOE)—this counteracts the influence of lurking variables, such as changing ambient conditions.

Designing an experiment is like gambling with the devil: only a random strategy can defeat all his betting systems.

— R.A. Fisher

Along those lines, I watched with interest when weather forecasts put Tampa at the bulls-eye of the projected track for Hurricane Isaac. My perverse thought was this might the best place to be, at least early on when the cone of uncertainty is widest.

In any case, one does best by expecting the unexpected. That gets me back to the topic of randomization, which turns out to be surprisingly hard to do considering the natural capriciousness of weather and life in general. When I first got going on DOE, I pulled numbered slips of paper out of my hard hat. Then a statistician suggested I go to a phone book and cull numbers from the last 4 digits from whatever page opened up haphazardly. Later I graduated to a table of random numbers (an oxymoron?). Nowadays I let my DOE software lay out the run order.

Check out how Conjuring Truly Random Numbers Just Got Easier, including the background by Keats on pioneering work in this field by British (1927) and American (1947) statisticians. Now the Australians have leap-frogged (kangarooed?) everyone, evidently, with a method that produces 5.7 billion “truly random” (how do they know?) values per second. Rad mon!

### Statisticians no more—now “data scientists”

Posted by mark in Basic stats & math on August 15, 2012

I spent a week earlier this month at the Joint Statistical Meetings (JSM)—an annual convocation of “data scientists”, as some of these number crunchers now deem themselves. But most statisticians remain ‘old school’ as evidenced by this quote:

“Some time during the past couple of years, statistics became data sciences older, more boring sibling that always played by the rules.”

— Nathan Yau*

I tend to agree—being suspicious of changes in titles as a cover for shenanigans. It seems to me that “data science” provides a smoke screen to take unwarranted leaps from shaky numbers. As the shirt sold at JSM by American Statistical Association (ASA) says, “friends don’t let friends extrapolate.”

*Incorrectly attributed initially (my mistake) to Carnegie Mellon statistics professor Cosma Shalizi, who was credited by Yau for speaking up on this subject.

### Where the radix point becomes a comma

Posted by mark in Basic stats & math on April 23, 2012

Prompted by an ever-growing flow of statistical questions from overseas, Stat-Ease Consultant Wayne Adams, recently circulated this Wikipedia link that provides a breakdown on countries using a decimal point versus a comma for the radix point—the separator of the integer part from the fractional side of a number.

For more background on decimal styles over time and place see this *Science Editor* article by Amelia Williamson. It credits Scottish mathematician John Napier* for being the first to use a period. However, it seems that he wavered later by using a comma, thus setting the stage for this being an alternative. Given the use of commas to separate thousands from millions and millions from billions and so on, numbers can be misinterpreted by several orders of magnitude very easily if you do not keep a sharp eye on the source.

So, all you math & stats boffins—watch it!

*As detailed in this 2009 blog I first learned of this fellow from seeing his bones on display at IBM’s Watson Research Center in New York.

### Obscurity does not equal profundity

Posted by mark in Basic stats & math, sports on February 12, 2012

*“GOOD with numbers? Fascinated by data? The sound you hear is opportunity knocking.”* This is how Steve Lohr of the *New York Times* leads off his article in today’s Sunday paper on The Age of Big Data. Certainly the abundance of data has created a big demand for people who can crunch numbers. However, I am not sure the end result will be nearly as profitable as employers may hope.

“Many bits of straw look like needles.”

– Trevor Hastie, Professor of Statistics, Stanford University, co-author of The Elements of Statistical Learning (2nd edition).

I take issue with extremely tortuous paths to complicated models based on happenstance data. This can be every bit as bad as oversimplifications such as relying on linear trend lines (re Why you should be very leery of forecasts). As I once heard DOE guru George Box say (in regard to overly complex Taguchi methodologies): Obscurity does not equal profundity.

For example, Lohr touts the replacement of earned run average (ERA) with the “Siera”—Skill-Interactive Earned Run Average. Get all the deadly details here from the inventors of this new pitching performance metric. In my opinion, baseball itself is already complicated enough (try explaining it to someone who only follows soccer) without going to such statistical extremes for assessing players.

The movie “Moneyball” being up for Academy Awards is stoking the fever for “big data.” I am afraid that in the end the call may be for “money back” after all is said and done.