Earlier this month, I posed some statistics interview questions. Here are possible answers.

1. Stirling’s formula holds that $\lim_{n\to\infty}{{\Gamma(n)e^{n}}\over{n^{n-1/2}}}=\sqrt{2\pi}$ , a result with broad utility in numerical recipes (the gamma function and concentration inequalities) and complexity (the notion of log-linear growth.)  It can follow directly from the central limit theorem.  How?

Suppose $X_{1},\dots,X_{n}$ are i.i.d. exponential(1).  Then $\overline{X}_{n}$ is distributed $\Gamma(n,1/n)$.  By the CLT, $\sqrt{n}\left(\overline{X}_{n}-1\right)\to N(0,1)$.

Therefore by the CLT,

$\lim_{n\to\infty}{{n^{n-1/2}}\over{\Gamma(n)e^{n}}}\left(t/\sqrt{n}+1\right)^{n-1}e^{-\sqrt{n}t}={{1}\over{\sqrt{2\pi}}}e^{-t^{2}/2}$

for all $t\ge0$.  Showing the result of the theorem requires recognizing that

${\left( {t / \sqrt{n}+1}\right)}^{n-1}e^{-\sqrt{n}t}=(t/\sqrt{n}+1)^{-1}\left({{(t/\sqrt{n}+1)^{\sqrt{n}}}\over{e^{t}}}\right)^{\sqrt{n}}\to e^{-t^{2}/2}$.

We’ll omit the details.

2. Can you think of how regularization and prior distributions are connected?

Generally we can characterize the cost function as a log-likelihood.  For instance, the sum-of-squares error in OLS given by

$\sum_{n=1}^{N}\left(ax_{n}+b-y_{n}\right)^{2}$

can be interpreted as a negative log-likelihood of

$\mathbb{P}(Y=y|X=x;a,b)={{1}\over{\sqrt{2\pi}}}e^{-(ax+b-y)^{2}/2}$.

We can coerce a Bayesian treatment by thinking of the regression coefficients as random phenomena, so that

$\mathbb{P}(Y=y,A=a,B=b|X=x)={{1}\over{\sqrt{2\pi}}}e^{-(ax+b-y)^{2}/2}\times\Pi(a,b)$.

This prior belief about the regression coefficients can take the form of any regularization we may choose to include in the original formulation.  For instance, suppose we really believe that the slope and intercept ought not be too big.  An L2 regularization would mean

$\Pi(a,b)=Ke^{-a^2-b^2}$

for some suitable constant $K$, akin to the regularization hyperparameter.

3. Where might the CLT run aground?

Answer : Any number of obstacles to invoking the CLT exist, including non-finite variance, unstable variance, lack of independence, and so on.  Specific examples include a ratio of two independent standard normal variables, ratios of exponentials, waiting times to exceed say the first measurement, and so on.

4. Can you offer a variance-stabilizing statistic for predicting success probability in a binomial sample?  Provide a $100(1-\alpha)$% confidence interval.

Answer : With the delta method, we can offer the test statistic

$T=2\arcsin(\sqrt{\overline{X}})$.

By the delta method, we have

$\sqrt{n}\left(2\arcsin(\sqrt{\overline{X}})-2\arcsin(\sqrt{p})\right)\to N(0,1)$.

The confidence interval, with work, is

$\left[A-B,A+B\right]$,

with

$A=\overline{x}\cos\left({{|z_{\alpha/2}|}\over{\sqrt{N}}}\right)+\sin^{2}\left({{|z_{\alpha/2}|}\over{2\sqrt{N}}}\right)$

and

$B=\sqrt{\overline{x}}\sqrt{1-\overline{x}}\sin\left({{|z_{\alpha/2}|}\over{\sqrt{N}}}\right)$.

A candidate capable of deriving the aforementioned in an hour interview would achieve a near unconditional pass.

5. Where does maximum likelihood estimation run into trouble?  Name three problems.

Answer : (1) Peakedness of the likelihood function can cause numerical instability, (2) sometimes the optimal solution falls outside the parameter space, and (3) there may be no global optimum.

A followup question is to query examples of each case.  Simple ones are estimating the size of a binomial trial, estimating parameters in subtended distributions, and unidentifiable parameters, respectively.  Answers may vary.

6. Consider a ratio of two exponential random variables.  If your boss asked you to approximate its expectation, how would you answer it!

Answer : If you got number three above, you already know the answer : the expectation does not exist.  Understanding the nuance is helpful in overcoming the challenges posed by ratio metrics.

7. If $X_{1},\dots,X_{N}$ are i.i.d. unif($0,\theta$), how would you estimate $\theta$?  Give an estimator and justification.

Answer : This is an excellent opportunity to discuss sufficiency, a satisfying means of describing information necessary to determining a parameter.  It turns out that the maximum order statistic $X_{(N)}$, distributed $\theta\times B(N,1)$, is sufficient for $\theta$.  Therefore, an unbiased estimator is

$Y={{N+1}\over{N}}X_{(N)}$,

with

$\text{Var }Y/\theta={{1}\over{N(N+2)}}$.

We can invoke Lehmann-Scheffe to claim our estimator is UMVUE, if we can show completeness, another convenient statistical property we’ll discuss more in the days ahead.  Offering a confidence interval is an interesting follow-up.

Much of the above comes from insights in Statistical Inference by Casella and Berger.  I’ll be interviewing Roger Berger in a few months for Algo-Stats.  If you’ve made it this far in my article, please reach out to me to chat.  npslagle

# One Hundred Statistics Inequalities

Six years ago, I sat in a randomized algorithms class taught by Dick Lipton, and he requested we students assemble a list of concentration inequalities.  Perfectionistically, I scoured textbooks, paper articles, and the internet for every last inequality I could unearth, building a respectable assortment of one hundred results of varying utility and import.  Dick had some design in mind on the assignment, though I never was able to determine his intentions, as he is famously scattered and hard to track down.

I posted them on my personal website years ago, and noticed lately that they’d found their way into lectures by computer scientist Junzhou Huang, a professor at one of my alma maters, the University of Texas at Arlington.  It occurred to me that they might interest a broader audience.

As time permits, I’ll try to expand them, and perhaps produce a workbook of proofs for many of them.

Take a look at them here, and  let me know what you think!

# On the Responsibility of Technologists : A Prologue and Primer

A special thank you to S. Kelly Gupta for invaluable suggestions, and to George Polisner and Noam Chomsky for taking the time to read an earlier draft and offer encouraging feedback.

## A Casting Call for the Conscientious Data Practitioner

For some time now, I’ve planned on writing an article about the very serious risks posed by my trade of choice, data science.  And with each passing day, new mishaps, events, and pratfalls delay publishing, as the story evolves even as I write this.  For instance,  Mark Zuckerberg testified before the Senate Judiciary Committee this week, sporting a smart suit and a booster seat ostensibly to improve morale.  Though some interesting topics came up, the discussion was routine, with the requisite fear-mongering from Ted Cruz, the bumbling Orrin Hatch asking how money comes from free things (apparently he forgot to ask Trump about withholding pay from blue-collar contractors), and a few more serious people asking about Cambridge Analytica, such as Kamala Harris querying the lengthy delay in Facebook notifying users of Cambridge, and, surprisingly, John Kennedy panning Facebook’s user agreement as “CYA” nonsense.

The tired, public relations newspeak of the mythical well-meaning, self-regulating corporations accompanies happily the vague acknowledgements of responsibility around certain things we heard from Zuckerberg, along with references to proprietary and thus unknowable strategies almost in place.  And though I doubt Congress in its current state can impose any reasonable regulations, nor would those in charge be capable of formulating anything short of a lobbyist’s Christmas list, my intention here is to argue for something more substantial : a dialog must begin among technologists, particularly data practitioners, about the proper role of the constructs we wield, as those constructs are powerful and dangerous.  And it isn’t just because a Russian oligarch might want Donald Trump to be president, or because financial institutions happily risk economic collapse at the opportunity to make a few bucks; data has the power to confer near omnipotence to the state, generate rapid, vast capital for a narrow few at expense of the many, and provide a scientifically-sanctioned cudgel to pound the impoverished and the vulnerable.  Malignant actors persist and abound, but complacency among the vast cadre of well-intentioned technologists reminds me of Martin Luther King, Jr.’s discussion of the “white moderate who is more devoted to ‘order’ than to justice.”  So I must clarify that I’m writing not to the bad people who already understand quite well the stakes, but to my fellow conscientious practitioners, particularly those among us who fear consequences to career or suffer under the peculiar delusion that we have no power.  Consequences are real, but  we as technologists wield great power, and that power is more than additive when we work together.  The United States is unusually free, perhaps in the whole of human history, in that we can freely express almost any idea with little or no legal ramification.  Let’s use that freedom together.

## A Lasting Legacy : Power and Responsibility

Fifty-one years ago last February, Noam Chomsky authored a prescient manifesto admonishing his fellow intellectuals to wield the might and freedom they enjoy to expose misdeeds and lies of the state.  Much of his discussion dwells on the flagrant dishonesty of particular actors as their public pronouncements evolved throughout the heinous crime that is the Vietnam War, and in more recent discussions, such as those appearing in Boston Review in 2011, describe the significant divide between intellectuals stumping for statism versus the occasional Eugene Debs, Rosa Luxemburg, and Bertrand Russell:

The question resonates through
the ages, in one or another
form, and today offers a
framework for determining the
“responsibility of intellectuals.”
The phrase is ambiguous: does it
refer to intellectuals’ moral
responsibility as decent human
beings in a position to use their
the causes of freedom, justice,
mercy, peace, and other such
sentimental concerns? Or does it
refer to the role they are expected
to play, serving, not derogating,
leadership and established institutions?

We technologists, a flavor of intellectuals, have ascended within existing institutions rapidly, for fairly obvious reasons.  More specifically, those of us in data science are enjoying a bonanza of opportunities, as institutions readily hire us in record numbers to sort out their data needs, uniformly across the public, private, good, bad, large, small dimensions.  We’re inheriting remarkable power and authority, and we ought approach it with respect and conscience.  Data, though profoundly beneficial and dangerous, is still just a tool whose moral value is something we as its priesthood, if you will, can and ought determine.  Chomsky’s example succinctly captures how we should view it :

Technology is basically neutral.
It's kind of like a hammer.
The hammer doesn't care whether
you use it to build a house or
crush somebody's skull.

We can ascribe more nuance, with mixed results.

## Data is Good? Evidence Abounds

I suspect I’m preaching to the choir if I remark on the impressive array of accomplishments made possible by data and corresponding analyses.  I believe the successes are immense and plentiful, and little investigative rigor is necessary here in the world of high tech to note how our lives are bettered by information technology.  Woven throughout the many successes, more subtly to the untrained eye than I or similar purists would prefer, is statistics, and the ensuing sexy taxonomy of machine learning, big data, analytics, and myriad other newfangled neologisms.  The study of random phenomena has made much of this possible, and I’d invite eager readers to take a look at C.R. Rao’s survey of such studies in Statistics and Truth.

I’m in this trade because I love it, I love science, I love technology, I love what it can do for you and me, and I’m in a fantastic toyland which I never want to leave.  So I must be very clear that I am no Luddite, nor would I advocate, except in narrow cases (see below), technological regression; the universal utility of much of what has emerged from human ingenuity has served to lengthen my life, afford me time to do the work I want, and make me comfortable.  Though the utility is so far very unevenly shared, I do believe we’ve made tremendous progress, and the potential is limitless.  So I’d entreat the reader potentially resistant to these ideas to brandish Coleridge’s “willing suspension of disbelief,” then judge for oneself.  My primary objective here is to begin a dialog.  Now for some of the hard stuff.

## Data is Bad? There is Evil, and There Are Malignant Actors

Evils of technology also are innumerable, as the very large, growing contingency of victims of drone attacks, guns, bombs, nuclear attacks and accidents, war in general, and so on, will attest.  Surveying the risks of technology leaves the current scope long behind, but it’s worth paying attention to the malignant consequences of runaway technology.  I’ll be reviewing Daniel Ellsberg’s The Doomsday Machine on my other blog soon; suffice it to say the book is good, the story is awful.  The book is a sobering, meticulous analysis of the most dangerous technology ever created, and how reckless and stupid planners were in safeguarding said technology.   Here, we’ll stick just to problems arising from bad data science, and the bad actors, be it ideologues, the avaricious, the careless, or the malevolent.

We ought consider momentarily the current state of affairs : Taylor Armerding of CSO compiled the greatest breaches of the current century, attempting to quantify the damage done in each case.  Since the publication of his summary, the Cambridge Analytica / Facebook scandal has emerged, sketching a broad “psychographic” campaign to manipulate users into surrendering priceless data and fomenting discord.  Quite dramatically, a 2016 memo leaked from within Facebook shows executive Andrew Bosworth quipping,

In other words, “don’t bother washing the blood off your money as you give it to us.”  Slate offers an interesting indictment on the business model that has rendered the exigencies of data theft, content pollution, and societal discord concrete, imminent contingencies.  And most recently, Forbes reports that an LGBT dating app called Grindr apparently permits backdoor acquisition of highly sensitive user data, endangering users and betraying their physical location.  And the first reported fatality due to driverless technology deployed by Uber occurred in Arizona this month, generating a frenzy of concerns around the safety and appropriateness of committing these vehicles into the public transportation grid.  The reaction I noted on the one social media platform I use, LinkedIn, was tepid, ranging from despairing emoticons to flagrant, arrogant pronouncements that this is the cost of the technology.  I also observed a peculiar response to those unhappy about the lack of security around user data : blame the victims.  The responses vary from the above declaration of cost of convenience to disdain for the lowly users in need of rescue from boredom, discussed by one employee of Gartner, a research firm :

let's be honest about
one thing: we all agree that
we give up a significant part
of our privacy when we decide
to create an account on Facebook[;]
[w]e exchange a part of our private
life for a free application that
prevents us from being bored most
time of the day.

I’d refer this person to Bosworth’s memorandum, though he, like CNN in 2010, likely hadn’t seen it before venturing such drivel.  I interpreted their argument as a public relations vanguard aimed at corporate indemnification.  Certainly, an alarming number of terms and conditions agreements aim to curtail class action lawsuits and, where legal, eliminate all redress through the court system.  On its face, this sounds ludicrous, as the court system is precisely the public apparatus for resolving civil disputes.  Arbitration somehow is a thing, with Heritage and concentrations of private power reliably defending it as freer than the public infrastructure over which citizens exercise some control, however meager.   Sheer genius is necessary to read

[n]o one is forced into arbitration[;]
[t]o begin with, arbitration is not
“forced” on consumers[...] [a]n obvious
point is that “no one forces an
individual to sign a contract[,]”

and interpret it any other way than that the freedom to live without technology is a desirable, or even plausible arrangement; Captain Fantasticanyone?

Maybe it’s a question of volume, as catechismic, shrill chanting that we have no privacy eventually compels educated people write the utter nonsense above.  If one were to advance the argument further, it’s akin to blaming the victims of the engineering flaws in Ford’s Pinto; after all, the car rescues the lower strata of society from having to walk or taxi everywhere they want to go, and death by known engineering flaws is the cost of doing business.  The arrogance evokes Project SCUM, the internal designation for a marketing campaign tobacco giant Camel aimed at gays and the homeless in San Francisco in the 1990s.

Governments cause even greater harm, exhibited in Edward Snowden’s whistleblowing on the NSA’s pet project to spy on you and me, code-named PRISM.  Comparably disconcerting, Science Alert reported this week that the development of drone technology leaving target acquisition in the control of artificial intelligence is almost complete, meaning drones can murder people using inscrutable and ultimately unaccountable data models.  State-of-the-art robotic vision mistakes dogs for blueberry muffins in anywhere from one to ten percent of static images analyzed, depending on the neural network model, meaning a drone aiming at a muffin would destroy one to ten percent of the dogs mistaken, and this is training on static imagery!  Imagine the difficulties in a dynamic field-of-view with exceedingly narrow time windows necessary to overcome errors.  Human-controlled drones already represent enormous controversy, operating largely in secret without legislative or judicial review under the direction of the executive branch of the American government.  Who must answer for a runaway fleet of drones?  What if they’re hijacked?

More locally, Guardian recently unmasked the racist facial recognition models deployed by law enforcement agencies, bemoaning the existence of “unregulated algorithms.”  I’d wager the capability to reverse-engineer a machine learning model to steal private data receives great attention among adversarial actors and private corporations.  I can remember in my first job many years ago being in a discussion over an accidental leak of a few lines of FORTRAN to a subcontractor, to which I naively queried, “Why are we in business with someone we think would steal from us?”  A manager calmly replied that anyone and everyone would steal, and in any way they can.  Maybe it’s true, but I’d like to believe there’s more to countervailing passive resistance than meets the eye.  In any case, data science and artificial technology are tools co-opted for sinister and dangerous purposes, and we ought try to remember that.

## Data is Ugly? Errors and Injustice, Manned and Unmanned

Data needs no bad actor or vicious intent to be misleading.  Rao refers to numerous unintentional examples of data misuse within the scientific record, peppered throughout the works of luminaries such as Gregor Mendel, Isaac Newton, Galilei Galileo, John Dalton, and Robert Millikan, as documented by geneticist  J.B.S. Haldane and Broad and Wade’s Betrayers of the Truth.  For instance, the precision Newton provided for the gravitational constant is well beyond his capacity to measure, and Mendel’s genetic models could explain the recorded data only with astronomical probability, suggesting either transcription errors or blatant cherrypicking.  Rao notes

[w]hen a scientist was
convinced of his theory,
there was a temptation to
look for "facts" or distort
facts to fit the theory[; t]he
concept of agreement with theory
within acceptable margins of
error did not exist until the
statistical methodology of
testing of hypotheses was
developed.

That is, statistical illiteracy can only compound the problem of “fixing intelligence and facts around the policy,” to paraphrase the infamous Downing Street Memo.

Statistical literacy doesn’t guarantee good outcomes, even with honest representation.  Data can reinforce wretched social outcomes by identifying the results of similar failed policies of the past.  For instance, everyone knows African Americans are more likely to be harassed by police.  Thus, they’re more likely to be arrested, indicted, charged, and convicted of crimes.  Machine learning algorithms identify outcomes and race as significantly interdependent, and new policy dictates that police should carefully monitor these same people.   Asking why we ought trust an inscrutable model is unmentionable, reminding me that earlier propagandists invoked the “will of God” as justification for slavery, and later, the “free market” requires that some people be so poor that they starve.  Maybe elites always require some ethereal reason for the suffering we permit to pass in silence.  Anecdotally on racism, a myopic cohort once pronounced triumphantly to me that racists aren’t basing their prejudice on skin color, but on other features correlated with skin color.  The Ouroboros, or some idiotic variant, comes to mind.

## Weapons of Math Destruction : Destructive Models

Cathy O’Neil in Weapons of Math Destruction (WMDs) ponders such undesirable social outcomes of big data crippling the poor and the disadvantaged.  Within the trade, dumb money describes the proceeds mined and fleeced from vulnerable populations.  The money poor people have ranges from real estate to be reverse-mortgaged, poverty and veteran status to leverage for education grants and loans, desperation of the poor in the form of title loans, payday loans, and other highly destructive financial arrangements.  Myriad examples of startups and firms abound, from for-profit online education firms like Vatterott and Corinthian Colleges targeting veterans and the poor to cash in on student loans, and their enabling advertising firms such as Neutron Interactive post fake job ads to cull poor people’s phone numbers to blast them with exaggerated ads.  Thinktank Learning, and similar firms model student success, helping universities and colleges game the U.S. News and World Report ranking system, a perfect example of a WMD.  Comstat and Hunchlab help resource-starved police departments profile citizens based on geography, mixing nuisance crimes with the more violent variant and strengthening racial stereotypes.  Courts rely now on opaque models to assess risk of convicts, determining sentences accordingly, according to a piece in Wired last year.  Ought we understand the reasons why two criminals convicted of the same crime receive different sentences?  The book is very much worth a read.  Her own journey is revealing, having been an analyst at D.E. Shaw around the time of the market crash.

Data has accumulated over the years that ETS’s prized Graduate Record Examination (GRE), a test required for candidacy in most American graduate programs,

• has disproportionately favored the white, the rich, and the male, (sounds like a WASP daytime drama),
• may not be all that useful for prediction, and
• operates in darkness, inscrutably like many such “psycho-social” metrics.

My own personal experience with the examination is kind of interesting and comical : I’m apparently incapable of writing.  Being a south paw, my penmanship is atrocious, but I seem to remember having typed the essay… Kidding aside, acquiring feedback from them was impossible, and they led me to believe that the essay receives grades via an electronic proofreader.  I guess no one remained who could interpret the algorithm’s outputs.

A more serious question O’Neil raises is that machine learning models suffer many of the same biases and preferences born by their architects; I think of ETS reinforcing malignant stereotypes, a kind of “graduate ethnic cleansing.”  Algorithms running for Title Max target the poor, making them poorer still.  More seriously, what are these models trying to optimize, and is it desirable behavior?

## The Problem of Proxies

O’Neil offers that part of the problem with building opaque data models to inform real world decisions is that the real world objective we’d like to improve is poorly proxied: unsuitable substitutes seem to be hogging the constraints.  For instance, how can an algorithm quantify whether a person is happy?  Happiness is something we all seem to understand (or think we do), and we can generally spot it or its shaded counterpart with little effort.  Millions of years have chiseled, then kneaded the gentle ridges of the prefrontal cortex to lasting import.  Algorithms might read any number of interesting features, and unlike consciousness itself, I suspect happiness, or at least its biological underpinnings, is something an algorithm could predict, but any definition suffers limitations.  My earliest intuitions in mathematics led me to believe that any state can be reproduced with sufficient insight into the operating principles.  Though the academy has largely reinforced what I used to call the “dice theory” (and I was all-too-proud to have dreamed it up myself), Galileo lamented centuries ago, as have others more recently, including Hume, Bertrand, and Chomsky, that the mechanical philosophy simply isn’t tenable.  More narrowly, we may be incapable as we are now to effectively proxy very important soft science social metrics.  I believe misunderstanding this may be fueling the insatiable appetite of start-up funding for applications lengthening prison sentences, undercutting college applicants, burdening teachers with arbitrary, easily falsified standards, bankrupting the poor, and harassing and profiling the most vulnerable.  Is society better off with young black men fearing to walk the street at night with the justified concern of being murdered?

A striking example of poor proxying is invoking the stock market as the barometer of the economy.  And this is something I see in social media time and time again.  Missing from the euphoria is that for nearly fifty years, the Gini index is positively correlated with the S&P 500, the former measuring economic inequality and the latter indexing the “health” of the stock market.  That is, as the stock market becomes healthier, the distribution of the money supply drifts away from the uniform.  Not coincidentally, this behavior seems to begin right around Nixon shock, or the deregulation of finance and the dismantling of Bretton-Woods.  In his 2004 book The Conservative Nanny State, economist Dean Baker discusses “perverse incentives” in maximizing incorrect proxies in patent trolling, wasteful copycat drug development, and the like.  The U.S. Constitution guarantees copyright protection to promote development of science, contravened by wasting sixty percent of research and development money on marketing and replicated research.

Even in a more seemingly innocuous setting, say social media, do we see deep problems in proxies.  Shares and likes become the currency of interaction, and social desirability need not interfere for most.  I’ve noticed in my own experiences in writing comments online that a frenetic vigilance overcomes me if I feel I’ve been misunderstood or have given the wrong sort of offense, as I’m (perhaps pathologically) hardwired to care about the feelings of others.  By interacting online rather than in-person, a host of nonverbal cues and information are absent, forcing us to rely on very weak proxies.  Psychology Today touched on this in 2014, and I suspect the growing body of evidence that flitting, vapid interactions online are damaging social intelligence demonstrates that the atomization of American culture is in no way served by social media.

Admittedly, the story seems dire, but belying the deafening silence is a groundswell of conscientious practitioners, fragmented and diffuse, but pervasive and circumspect.

## The Courage to Speak

When I discuss any of the above with cohorts privately, a very large fraction agree on the dangers of misusing this technology; reflexive is incorrect habituated resignation, especially in America where illusory impotence reigns supreme. And so I see very little in the way of commentary on these issues from tradespersons themselves, though a handful from my network are reliable in discussing controversy.  Perhaps the psychology is simpler : is it fear of blowback and risks to career of the kind Eugene Gu is experiencing with Vanderbilt?  Certainly even popular athletes face blacklisting, Colin Kaepernick being an exemplar.  Speaking out is risky, but silence strengthens what Chomsky calls “institutional stupidity“, of which some of the above quotes embody.

The point I’m trying to drive home is that the responsibility of we the technologists demands an end to controversy aversion; we simply MUST begin talking about what we do.  Make no mistake, the ensuing void of silence emboldens demagoguery in malignant actors, such as the aforementioned projections on unmanned, computer-controlled drone warfare, further deterioration of the criminal justice system, exploitation of the poor and vulnerable, and wrecking the global economic system.  Further, refusing to speak out assures a platform for desperately irresponsible, dangerous responses of blaming or ridiculing the victims, a sort of grinding salt in the wounds.  Consider the extreme variant of the latter : Rick Santorum, Republican brain trust, has sagely admonished school shooting survivors to learn CPR rather than protest and organize to demand safety, and Laura Ingraham, shrill, imbecilic Fox host, has gleefully tweeted juvenile insults at one of the outspoken survivors.  Why would we relegate damage done by runaway data science as the cost of doing business, if we can clearly perceive the elitism and cynicism in the above?  Silence may seem safe, but is it really?  Ignoring sharpening income inequality, skyrocketing incarceration rates, and stratification and segregation has a cost : Trumps of the world become leaders, the downtrodden looking to demagogues.

## The Coming Storm Following the Dream

With each public relations disaster and each discovery of flagrant disregard for users and their precious private data, we hurtle toward what I believe are an inevitable series of lawsuits and criminal investigations leading to public policy we ought to help direct.  C.R. Rao wrote some years ago regarding a lawsuit against the government failing to act to save fishermen from a predictable typhoon, plaintiffs’ chief issue being that the coast guard failed to repair a broken buoy :

[s]uch instances will be rare,
but none-the-less may discourage
statistical consultants from
venturing into new or more
challenging areas and restrict
the expansion of statistics.
[emphasis mine]

The General Data Protections Regulation, or (GDPR), organized by the European Union, is perhaps one of the broadest frameworks ratified by any national or supranational body.  This coming May, the framework will supersede the Data Protective Directive of 1995.  The US government has regulated privacy and data with respect to education since 1974 with FERPA and medicine since 1996 with HIPAA.  Yet court precedent hasn’t yet determined the interpretation of these acts with respect to machine learning models built on sensitive data.  What will an American variant of GDPR look like?  Practitioners ought have a say, and the more included in the discussion, the better the outcome.  But this sort of direction requires coordination, and because of the unique and difficult work we do, we are fractured from one another and more susceptible to dogmatism around the misnamed American brand of libertarianism.  The American dream is available to technologists (and almost no one else), whence a rigidity of certain non-collectivist values, enumerated in a study conducted by Thomas Corley for Business Insider : the rub is that wealthy people believe very strongly in self-determination, and assume they are responsible for their good fortune.  I think of it as the “I like the game when I’m winning” phenomenon, and like most deep beliefs, some kernel of truth is there.  We could spend considerable time just debating these difficulties, and my being married to a psychiatrist offers uncomfortable insight.  In any case, discussions surrounding this are ubiquitous, and my opinions, though somewhat unconventional, are straightforward.  Historically, collective stands are easier to make and less risky than those alone.  In semi-skilled and clerical trades, we called these collections “unions.”  Professional societies such as the AMA, the ASA, the IEEE, and so on, are the periwinkle-to-white collar approximations, with the important similarity that collectively asserting will just simply works better.  And yet, we in data science have little in the way of such a framework.  It’s worth understanding why.

## Cosmic Demand Sans Trade Union

The skyrocketing demand for new data science and machine learning technology, together with a labor dogmatism peculiar to the United States have left us, so it would seem, without a specific trade union that is independent of corporations and responsible for governing trade ethics and articulating public policy initiatives.  Older technology trades have something approximating a union in the professional societies such as IEEE and the American Statistical Association; like the American Medical Association and the American Psychological Association, these agencies offer codes of ethical practices and publications detailing the latest comings and goings in government regulation, technology, and the like.  Certainly, the discussion occurs here and there, though Steve Lohr’s 2013 piece in the New York Times summarizing a panel discussion at Columbia hinted a common refrain in our trade:

[t]he privacy and surveillance
perils of Big Data came up only
in passing[...] during a
one panel, Ben Fried,
officer, expressed a misgiving[:]
“[m]y concern is that the
technology is way ahead of society[.]

That is, we all know we have a problem, but little is happening in the way of addressing it.  A smattering of public symposia have emerged on certain moral considerations around artificial intelligence, though much of what is easily unearthed is some older articulations by Ray Kurzweil, Vernor Vinge, and older still those by Isaac Asimov.  These often take the form of dystopian prognostications of robot intelligence, though I agree with Chomsky that we’re perhaps light years away from understanding even the basic elements of human cognition, and that replicating anything resembling that is not on the horizon.  Admittedly, my starry-eyed interest in Kurzweil’s projected singularity is what pulled me into computer science, but Emerson warns us that intellectual inflexibility belongs to small minds.  Fear-mongering of the future brings me to a spirit we ought exorcise early and often.

## Unemployment and Automation : A New(ish) Bogeyman

No discussion of the impact of our technology would be complete without paying a little attention to the fevered musings and catastrophization of mass unemployment due to automation.  We as a society of technologists ought have a simple answer to this, namely that the post-industrial revolution mindset of compulsory employment as monetized by imagined market forces is illogical, inefficient, and unnecessarily dangerous to who we are and what we do.  Even less charitably, slavish genuflection to the free market mania is an obstacle, rather than a catalyst, to progress, as the complexities of civilization necessitate a more nuanced economic framework.  Though we’d need another article or so for better justification for the foregoing, I’ll skip to the conclusion to say that we must restore and strengthen public investment in technology democratically and transparently, casting off militarization and secrecy.  A good starting place is the realization that virtually all high tech began in the public sector, and that’s a model that serves both society and technologists.  It also organically nurtures trade consortia of the variety described above.  In any case, the principal existential threats we face have nothing to do with mass employment, though thwarting those threats, nuclear proliferation and catastrophic climate change, might require it.

## Triage and Final Thoughts

Answering these current events demands responsible, courageous public discourse, appropriately supporting victims and formulating strategies to avert the totally preventable disasters above.  We should organize a professional society free of corporate, and initially governmental, interference, comprised of statisticians, analysts, machine learning scientists, data scientists, artificial intelligence scientists, and so on, so that we can internally by conference

• collectively educate ourselves about the ramifications of our work, such as reading work by trade specialists such as O’Neil,
• jointly draft position papers on requests for technical opinions by government and supranational organizations, such as a recent request from NIH,
• dialog openly about corporate malfeasance,
• draft articles scientifically explaining how best to regulate our work to safeguard  and empower the public (eloquently stated in Satya’s mission statement),
• collectively sketch safe, sensible guidelines around implementations of pie-in-the-sky technology (such as self-driving cars), and
• strategize how to redress public harm when it happens.

A few technologists, such as George Polisner, have very publicly taken stands against executive docility with respect to the Trump administration; his building of the social media platform civ.works is a great step in evangelizing elite activism, and, of course, privacy guarantees no data company will offer.  Admittedly, we all need not necessarily surrender positions in industry in order to address controversy, but we can and must talk to each other.  Talk to human beings affected by our work.  Talk to our neighbors.  Talk to our opponents.  The ugly legal and political fallout awaiting us is really just a hapless vanguard of the much more dangerous elite cynicism and complacency.  How do we ready ourselves for tomorrow’s challenges?  It begins with a dialog, today.

# A Few Fun Statistics Interview Questions

Much of what we do in statistics requires a deeper understanding than running a package in R or python, though those skills can’t hurt.  Testing for statistical literacy can be a bit tricky, as scientists often fall into one of two camps : statistics is solved and thus not sufficiently important to cultivate in skills, or it’s completely opaque and perhaps uninteresting.

Conditioning on my own preferential treatment of statistics, I’d wager very few data scientists could answer the following questions.  We’ll defer providing sources to avoid giving up the answers.  If you’re interested in playing along, resist the temptation to search for answers online.  Think about how you would approach each of these without anything other than pencil and paper (if those archaisms still existed.)

1. Stirling’s formula holds that $\lim_{n\to\infty}{{\Gamma(n)e^{n}}\over{n^{n-1/2}}}=\sqrt{2\pi}$ , a result with broad utility in numerical recipes (the gamma function and concentration inequalities) and complexity (the notion of log-linear growth.)  It can follow directly from the central limit theorem.  How?
2. Can you think of how regularization and prior distributions are connected?
3. Where might the CLT run aground?
4. Can you offer a variance-stabilizing statistic for predicting success probability in a binomial sample?  Provide a $100(1-\alpha)$% confidence interval.
5. Where does maximum likelihood estimation run into trouble?  Name three problems.
6. Consider a ratio of two exponential random variables.  If your boss asked you to approximate its expectation, how would you answer it!
7. If $X_{1},\dots,X_{N}$ are unif($0,\theta$), how would you estimate $\theta$?  Give an estimator and justification.

# Welcome to Algo-Stats

Recent events in industry have heralded an avalanche of interest in all things data.  Stakeholders, both public, private, and everything in-between are racing to cash in on the tsunami of freshly collected data, and companies, government agencies, and a litany of others are clamoring and scraping for more expertise in the nascent field of machine learning, and its proper forefather discipline, statistics.  Though predictions may vary, McKinsey Global predicting a demand of 2.9 million jobs requiring data analytic skills this year, Forbes reporting a 650% increase in data science positions appearing on LinkedIn in the last few years, the evidence is overwhelming that demand is skyrocketing and talent is scarce.  For those of us already in the field, it’s very good news indeed.

I’ve noticed in particular a peculiar proliferation of data programs, Udacity and Coursera-style mini-courses designed to generate more and more data scientists, and a surge of LinkedIn content geared toward conversational data science and mutual-congratulatory reverie.  Connections of mine suddenly are brandishing their shiny-new course certifications, ready and able to dive into a sea of messy, unwieldy data to mine for the sparse nugget of value.  Their stories are interesting.

As the data science fever has raged upward and onward, I’m increasingly cognizant of something truly unique, a convergence of public and private interest in what automation, data, and the science behind it can do.  Those of us in this space are uniquely situated to mentor and raise up the next generation of scientists in artificial intelligence.  And so I come to the rationale for this blog.  Advocacy and mentoring are important objectives for me, as those of you who’ve read my political blog know.  I’ve also recently weathered a health crisis locking me face-to-face with mortality, so I have a heightened sense of urgency around accomplishing my key objectives.  Further, the kind of data science I do is unique, even within the trade, as I enjoy dusting off and leveraging techniques from statistics lost in the excitement of machine learning, and all that goes with it.  My aim here is to tell a story, teach some concepts, and share with data scientists and enthusiasts alike discussions with authors, experts, social scientists, and many others.

Welcome to Algo-Stats!

NP Slagle