🎯 Practical: Revolutions: current events

current events

October 02, 2017

Comparing assault death rates in the US to other advanced democracies

In an effort to provide context to the frequent mass shootings in the United States, Kieran Healy (Associate Professor of Sociology at Duke University) created this updated chart comparing assault death rates in the US to that of 23 other advanced democracies. The chart shows the rate (per 100,000 citizens) of death caused by assaults (stabbings, gunshots, etc. by a third party). Assaults are used rather than gun deaths specifically, as that's the only statistic for which readily comparable data is available. The data come from the OECD Health Status Database through 2015, the most recent complete year available.

The goal of this chart is to "set the U.S. in some kind of longitudinal context with broadly comparable countries", and to that end OECD countries Estonia and Mexico are not included. (Estonia suffered a spike of violence in the mid-90's, and Mexico has been embroiled in drug violence for decades. See the chart with Estonia and Mexico included here.) Healy provides a helpful FAQ justifying this decision and other issues related to the data and their presentation.

Healy used the R language (and, specifically the ggplot2 graphics package) to create this chart, and the source code is available on Github.

For more context around this chart follow the link below, and also his prior renderings and commentary related to the same data through 2013 and through 2010.

Kieran Healy: Assault deaths to 2015

Posted by David Smith at 14:28 in current events, graphics, R | Permalink | Comments (1)

September 19, 2017

Hurricane Irma's rains, visualized with R

The USGS has followed up their visualization of Hurricane Harvey rainfalls with an updated version of the animation, this time showing the rain and flooding from Hurricane Irma in Florida:

Another #rstats #dataviz! Precip and #flooding from #HurricaneIrma 💧 #opensource code: https://t.co/rpocPQe7zR #openscience pic.twitter.com/rGX1SNiYEM
— USGS R Community (@USGS_R) September 15, 2017

This iteration improves the Harvey version by displaying rainfall in a fine grid over the state rather than on a county-by-county basis. Once again you can find the R code and data on Github, and an interactive version of the chart is available at the link below.

USGS Vizlab: Hurricane Irma's Water Footprint

Posted by David Smith at 06:51 in current events, government, graphics, R | Permalink | Comments (0)

September 12, 2017

September 08, 2017

Hurricane Harvey's rains, visualized in R by USGS

On August 26 Hurricane Harvey became the largest hurricane to make landfall in the United States in over 20 years. (That record may yet be broken by Irma, now bearing down on the Florida peninsula.) Harvey's rains brought major flooding to Houston and other coastal areas in the Gulf of Mexico. You can see the rainfall generated by Harvey across Texas and Louisiana in this animation from the US Geological Survey of county-by-county precipitation as the storm makes its way across land.

Watch #Harvey move thru SE #Texas spiking rainfall rates in each county (blue colors) Interactive version: https://t.co/qFrnyq3Sbm pic.twitter.com/md0hiUs9Bb
— USGS Coastal Change (@USGSCoastChange) September 6, 2017

Interestingly, the heaviest rains appear to fall somewhat away from the eye of the storm, marked in orange on the map. The animation features Harvey's geographic track, along with a choropleth of hourly rainfall totals and a joyplot of river gage flow rates, and was created using the R language. You can find the data and R code on Github, which makes use of the USGS's own vizlab package which facilitates the rapid assembly of web-ready visualizations.

You can find more information about the animation, including the web-based interactive version, at the link below.

USGS Vizlab: Hurricane Harvey's Water Footprint (via Laura DeCicco)

Posted by David Smith at 08:36 in applications, current events, graphics, R | Permalink | Comments (1)

August 15, 2017

Buzzfeed trains an AI to find spy planes

Last year, Buzzfeed broke the story that US law enforcement agencies were using small aircraft to observe points of interest in US cities, thanks to analysis of public flight-records data. With the data journalism team no doubt realizing that the Flightradar24 data set hosted many more stories of public interest, the challenge lay in separating routine, day-to-day aircraft traffic from the more unusual, covert activities.

So they trained an artificial intelligence model to identify unusual flight paths in the data. The model, implemented in the R programming language, applies a random forest algorithm to identify flight patterns similar to those of covert aircraft identified in their earlier "Spies in the Skies" story. When that model was applied to the almost 20,000 flights in the FlightRadar24 dataset, about 69 planes were flagged as possible surveillance aircraft. Several of those were false positives, but further journalistic inquiry into the provenance of the registrations led to several interesting stories.

Using this model, Buzzfeed news identified several surveillance aircraft in action during a four-month period in late 2015. These included a spy plane operated by US Marshals to hunt drug cartels in Mexico; aircraft covertly registered to US Customs and Border Protection patrolling the US-Mexico border; and a US Navy contractor operating planes circling several points over land in the San Francisco Bay Area — ostensibly for harbor porpoise research.

You can learn more about the stories Buzzfeed News uncovered in the flight data here, and for details on the implementation of the AI model in R, follow the link below.

Github (Buzzfeed): BuzzFeed News Trained A Computer To Search For Hidden Spy Planes. This Is What We Found.

Posted by David Smith at 15:05 in applications, current events, data science, R | Permalink | Comments (2)

May 26, 2017

Who is the caretaker? Evidence-based probability estimation with the bnlearn package

by Juan M. Lavista Ferres , Senior Director of Data Science at Microsoft

In what was one of the most viral episodes of 2017, political science Professor Robert E Kelly was live on BBC World News talking about the South Korean president being forced out of office when both his kids decided to take an easy path to fame by showing up in their dad’s interview.

The video immediately went viral, and the BBC reported that within five days more than 100 million people from all over the world had watched it. Many people around the globe via Facebook, Twitter and reporters from reliable sources like Time.com thought the woman that went after the children was her nanny, when in fact, the woman in the video was Robert’s wife, Jung-a Kim, who is Korean.

The confusion over this episode caused a second viral wave calling out that people that thought she was the nanny should feel bad for being stereotypical.

We decided to embrace the uncertainty and take a data science based approach to estimating the chances that the person was the nanny or the mother of the child, based on the evidence people had from watching the news.

@David_Waddell What would that mean, please? Re-broadcasting it on BBC TV, or just here on Twitter? Is this kinda thing that goes 'viral' and gets weird?
— Robert E Kelly (@Robert_E_Kelly) March 10, 2017

What evidence did viewers of the video have?

the person is American Caucasian
the person is professional
there are two kids
the caretaker is Asian

We then look for probability values for these statistics. (Given that Professor Kelly is American, all statistics are based on US data.)

Probability (Asian Wife | Caucasian Husband) = 1% [Married couples in the United States in 2010]
Probability of (Household has Nanny | husband is professional) = 3.5% [The Three Faces of Work-Family Conflict, page 9, Figure 3]
Probability of (Asian | Nanny) = 6% [Caregiver Statistics: Demographics]
Probability of (Stay at home mom) = 14% and Probability of (Stay at home mom | Asian Wife) = 30% [Stay-at-Home Mothers by Demographic Group ]

We define the following Bayesian network using the bnlearn package for R. We create the network using the model2network function and then we input the conditional probability tables (CPTs) that we know at each node.

library(bnlearn)
set.seed(3)
net <- model2network("[HusbandDemographics][HusbandIsProfessional][NannyDemographics][WifeDemographics|HusbandDemographics][StayAtHomeMom|HusbandIsProfessional:WifeDemographics][HouseholdHasNanny|StayAtHomeMom:HusbandIsProfessional][Caretaker|StayAtHomeMom:HouseholdHasNanny][CaretakerEthnicity|WifeDemographics:Caretaker:NannyDemographics]")

plot(net)

The last step is to fit the parameters of the Bayesian network conditional on its structure, the bn.fit function runs the EM algorithm to learn CPT for all different nodes in the above graph.

yn <- c("yes", "no")
ca <- c("caucacian","other")
ao <- c("asian","other")
nw <- c("nanny","wife")

cptHusbandDemographics <- matrix(c(0.85, 0.15), ncol=2, dimnames=list(NULL, ca)) #[1]
cptHusbandIsProfessional <- matrix(c(0.81, 0.19), ncol=2, dimnames=list(NULL, yn)) #[2]
cptNannyDemographics <- matrix(c(0.06, 0.94), ncol=2, dimnames=list(NULL, ao)) # [3]
cptWifeDemographics <- matrix(c(0.01, 0.99, 0.33, 0.67), ncol=2, dimnames=list("WifeDemographics"=ao, "HusbandDemographics"=ca)) #[1]
cptStayAtHomeMom <- c(0.3, 0.7, 0.14, 0.86, 0.125, 0.875, 0.125, 0.875) #[4]

dim(cptStayAtHomeMom) <- c(2, 2, 2)
dimnames(cptStayAtHomeMom) <- list("StayAtHomeMom"=yn, "WifeDemographics"=ao, "HusbandIsProfessional"=yn)

cptHouseholdHasNanny <- c(0.01, 0.99, 0.035, 0.965, 0.00134, 0.99866, 0.00134, 0.99866) #[5]
dim(cptHouseholdHasNanny) <- c(2, 2, 2)
dimnames(cptHouseholdHasNanny) <- list("HouseholdHasNanny"=yn, "StayAtHomeMom"=yn, "HusbandIsProfessional"=yn)

cptCaretaker <- c(0.5, 0.5, 0.999, 0.001, 0.01, 0.99, 0.001, 0.999)
dim(cptCaretaker) <- c(2, 2, 2)
dimnames(cptCaretaker) <- list("Caretaker"=nw, "StayAtHomeMom"=yn, "HouseholdHasNanny"=yn)

cptCaretakerEthnicity <- c(0.99, 0.01, 0.99, 0.01, 0.99, 0.01, 0.01, 0.99, 0.01,0.99,0.99,0.01,0.01,0.99,0.01,0.99)
dim(cptCaretakerEthnicity) <- c(2, 2, 2,2)
dimnames(cptCaretakerEthnicity) <- list("CaretakerEthnicity"=ao,"Caretaker"=nw, "WifeDemographics"=ao ,"NannyDemographics"=ao)

net.disc <- custom.fit(net, dist=list(HusbandDemographics=cptHusbandDemographics, HusbandIsProfessional=cptHusbandIsProfessional, WifeDemographics=cptWifeDemographics, StayAtHomeMom=cptStayAtHomeMom, HouseholdHasNanny=cptHouseholdHasNanny, Caretaker=cptCaretaker, NannyDemographics=cptNannyDemographics,CaretakerEthnicity=cptCaretakerEthnicity))

Once we have the model, we can query the network using cpquery to estimate the probability of the events and calculate the probability that the person is the nanny or the wife based on the evidence we have (husband is Caucasian and professional, caretaker is Asian). Based on this evidence the output is that the probability that she is the wife is 90% vs. 10% that she is the nanny.

probWife <- cpquery(net.disc, (Caretaker=="wife"),HusbandDemographics=="caucacian" & HusbandIsProfessional=="yes" & CaretakerEthnicity=="asian",n=1000000)
probNanny <- cpquery(net.disc, (Caretaker=="nanny"),HusbandDemographics=="caucacian" & HusbandIsProfessional=="yes" & CaretakerEthnicity=="asian",n=1000000) 

[1] "The probability that the caretaker is his wife  = 0.898718647764449"
[1] "The probability that the caretaker is the nanny = 0.110892031547457"

In conclusion, if you thought the woman in the video was the nanny, you may need to review your priors!

The bnlearn package is available on CRAN. You can find the R code behind this post here on GitHub or here as a Jupyter Notebook.

Posted by Guest Blogger at 08:36 in current events, packages, R, statistics | Permalink | Comments (2)

March 24, 2017

Comparing subreddits, with Latent Semantic Analysis in R

FiveThirtyEight published a fascinating article this week about the subreddits that provided support to Donald Trump during his campaign, and continue to do so today. Reddit, for those not in the know, is an popular online social community organized into thousands of discussion topics, called subreddits (the names all begin with "r/"). Most of the subreddits are a useful forum for interesting discussions by like-minded people, and some of them are toxic. (That toxicity extends to some of the names, which is reflected in some of the screenshots below — apologies in advance.) The article looks at various popular and notorious subreddits and finds those that are most similar to the main subreddit devoted to Donald Trump and also to the main other contenders in the 2016 campaign for president, Hillary Clinton and Bernie Sanders.

The underlying method used to compare subreddits for this purpose is quite ingenious. It's based on a concept you might call "subreddit algebra": you can "add" two subreddits and find a third that reflects the intersection of the two. (One example they give is adding r/nba to r/minnesota gives you r/timberwolves, the subreddit for Minnesota's NBA team.) The you can apply the same process to subtraction: if you remove all the posts like those in the mainstream r/politics site from those in r/The_Donald you're left with posts that look like those in several toxic subreddits.

The statistical technique used to identify posts that are "similar" to another is Latent Semantic Analysis, and the article gives this nice illustration of using it to compare subreddits:

The analysis was performed in R, and the code is available in GitHub. The code makes heavy use of the lsa package for R, which provides a number of functions for performing latent semantic analysis. The triangular plot shown above — known as a ternary diagram — was created using the ggtern package.

For the complete subreddit analysis, and the list of subreddits close to Donald Trump based on the analysis, check out the FiveThirtyEight article linked below.

FiveThirtyEight: Dissecting Trump's Most Rabid Following

Posted by David Smith at 13:38 in current events, graphics, packages, R | Permalink | Comments (1)

January 10, 2017

The anatomy of a useful chart: NOAA's flood forecasts

With thanks to NOAA's incredible data gathering and forecasting activities, I've been obsessed with this chart for the past few days:

We used to live near the Napa river where this river gage is located, and still have many friends in the area. We were in the area last weekend, when a "pineapple express" weather event brought an atmospheric river over much of California, with much rain and some flooding in low-lying areas. This was just before the first peak in the chart above, which shows the water level in the Napa river (in blue) along with a NOAA forecast (in purple). I was checking this chart obsessively, as the observed water level approached the "Major Flood" level, and experienced alternate bouts of hope and fear as the forecast skirted above the line from time to time.

Relying on this chart so intently made me appreciate what is takes to make a useful chart, so let's look at the ways this particular chart stands out. (While NOAA does use R for some hydrological charts, I don't think R was used for this one.)

The chart is updated frequently, and the most recent data point is highlighted. New river levels were posted every 15 minutes, and at as the crest was peaking knowing how recent the data were was critical.

A forecast is provided. The purple dots are based on a hydrological forecast, which includes information from upstream gages, weather forecasts, and the river formation around this particular location. This was an incredibly useful tool during the flood threat. However, the forecast is only updated every few hours, so having the recency of the forecast on the chart was incredibly helpful.

Context is provided for the measurements and forecast. I hadn't really paid much mind to the river level before — most of the time it's not much more than a minor stream. But knowing what river levels represented minor, moderate or major flooding (with their detailed definitions) was important. (As you can see, the river just avoided the major flooding stage on Sunday, and indeed the local town stayed mostly dry. Some vineyards were flooded, though)

Time zones are provided with times. There's nothing more frustrating than looking at a date or time, not knowing what time zone the data are provided in. This chart includes both the local time zone (PST) for the main axis and annotations and, on the top axis, Coordinated Universal Time. (17Z refers to 5PM Zulu time, which is 8 hours ahead of PST.)

The second Y axis. Having a second Y axis on a chart is rarely a good idea, but this is one of the examples where it's useful. The river flow is directly (but nonlinearly) related to river height, so presenting it here on the Y axis is useful for those that need it. (This is actually the value — not river height — used as input to the forecasts.) But while bridge engineers care about river flow, most are more concerned about the height, which is given top billing on the main Y axis. Bonus credit: units are provided for both axes: always a must-have, but lamentably often forgotten.

Annotations are provided for context. Having the recent and forecast peak heights and therecord flood height included on the chart provided context for the severity of the current flood threat, especially if you had experieced prior flood events in the area.

The chart is in PNG format. That's a good choice for a chart like this: it's a lossless format, which means the data points appear in perfect fidelity. It's also a fairly compact format that keeps image sizes small — important for a website that may experience a lot of traffic from many people constantly refreshing the report. (Thoughtfully, an auto-refresh option was also provided.) JPG, a lossy format that blurs small data points and straight lines, would have been a terrible choice here.

That's not to say this chart gets everything right. More resolution would have been helpful (especially when trying to comare the last data point to the prior -- is the river level rising or falling?). The color key for the flood stages is far from the chart on the webpage, rather than being included on the chart itself. The NOAA logo is a bit intrusive (though I understand why it's there). And in general the styling is could use an update (pseudo-3D chrome is so last century). But this chart gets many more things tahn it gets wrong, and provides a useful lesson in presenting data graphically that people can actually use.

As you can see from the chart, another flood-level crest is now heading down the river. As things stand now, it doesn't seem like it's going to be as severe as the one on Sunday, but to everyone in the affected areas: good luck, take care, and give thanks to NOAA for keeping us all informed.

Posted by David Smith at 10:26 in current events, graphics, R | Permalink | Comments (1)

December 02, 2016

Stylometry: Identifying authors of texts using R

Few people expect politicians to write every word they utter themselves; reliance on speechwriters and spokepersons is a long-established political practice. Still, it's interesting to know which statements are truly the politician's own words, and which are driven primarily by advisors or influencers.

Recently, David Robinson established a way of figuring out which tweets from Donald Trump's Twitter account came from him personally, as opposed to from campaign staff, whcih he verified by comparing the sentiment of tweets from Android vs iPhone devices. Now, Ali Arsalan Kazmi has used stylometric analysis to investigate the provenance of speeches by the Prime Minister of Pakistan.

By looking at the aspects of linguistic style (word/sentence length, frequency of word pairings, use of punctuation, etc.) of the speeches of the Prime Minister Nawaz Sharif, Ali found suggestions of at least 2 authors (and possibly more) behind the speeches. This is particularly apparent in this consensus network of appearances of 4-character sequences in speeches, which divides them into two clusters (of possibly differing authorship).

Ali used R and several packages to perform the analysis. These included the openNLP package to extract attributes from the speech data, the stylo package for stylometric analysis, the fpc package for the clustering, and the igraph package to visualize the clusters. The complete R script used for the analysis is available on Github.

For an overview of the analysis, check out this slide presentation by Ali, and for the complete details take a look at the blog post linked below.

A Blog On Data Analytics: How many Authors does the Prime Minister have for his speeches: A Stylometric Analysis

Posted by David Smith at 11:49 in current events, data science, packages, R | Permalink | Comments (0)

November 24, 2016

Happy Thanksgiving!

It's Thanksgiving day here in the US, so we're taking the rest of the week off to reflect on what we're thankful for. And even if you're not in the US, today is a great day to send thanks to the R Core Group for providing their dedication, time, and expertise to make the R Project what it is today.

(Sadly, cowsay doesn't feature a Thanksgiving turkey.) We'll be back next week. For those that are celebrating, have a great holiday. See you on Monday!

Posted by David Smith at 09:00 in current events, R | Permalink | Comments (1)

Aug	SEP	Oct
	14
2024	2025	2026

Revolutions

Milestones in AI, Machine Learning, Data Science, and visualization with R and Python since 2008