Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history.
History is littered with hundreds of conflicts over the future of a community, group, location or business that were "resolved" when one of the parties stepped ahead and destroyed what was there. With the original point of contention destroyed, the debates would fall to the wayside. Archive Team believes that by duplicated condemned data, the conversation and debate can continue, as well as the richness and insight gained by keeping the materials. Our projects have ranged in size from a single volunteer downloading the data to a small-but-critical site, to over 100 volunteers stepping forward to acquire terabytes of user-created data to save for future generations.
The main site for Archive Team is at archiveteam.org and contains up to the date information on various projects, manifestos, plans and walkthroughs.
This collection contains the output of many Archive Team projects, both ongoing and completed. Thanks to the generous providing of disk space by the Internet Archive, multi-terabyte datasets can be made available, as well as in use by the Wayback Machine, providing a path back to lost websites and work.
Our collection has grown to the point of having sub-collections for the type of data we acquire. If you are seeking to browse the contents of these collections, the Wayback Machine is the best first stop. Otherwise, you are free to dig into the stacks to see what you may find.
The Archive Team Panic Downloads are full pulldowns of currently extant websites, meant to serve as emergency backups for needed sites that are in danger of closing, or which will be missed dearly if suddenly lost due to hard drive crashes or server failures.
In an effort to provide context to the frequent mass shootings in the United States, Kieran Healy (Associate Professor of Sociology at Duke University) created this updated chart comparing assault death rates in the US to that of 23 other advanced democracies. The chart shows the rate (per 100,000 citizens) of death caused by assaults (stabbings, gunshots, etc. by a third party). Assaults are used rather than gun deaths specifically, as that's the only statistic for which readily comparable data is available. The data come from the OECD Health Status Database through 2015, the most recent complete year available.
The goal of this chart is to "set the U.S. in some kind of longitudinal context with broadly comparable countries", and to that end OECD countries Estonia and Mexico are not included. (Estonia suffered a spike of violence in the mid-90's, and Mexico has been embroiled in drug violence for decades. See the chart with Estonia and Mexico included here.) Healy provides a helpful FAQ justifying this decision and other issues related to the data and their presentation.
Healy used the R language (and, specifically the ggplot2 graphics package) to create this chart, and the source code is available on Github.
For more context around this chart follow the link below, and also his prior renderings and commentary related to the same data through 2013 and through 2010.
This iteration improves the Harvey version by displaying rainfall in a fine grid over the state rather than on a county-by-county basis. Once again you can find the R code and data on Github, and an interactive version of the chart is available at the link below.
Want to know what's capturing the attention of the producers at the 24-hour cable news stations? There's no equivalent of Twitter's trending topics for the likes of CNN or BBC News, but the newsflash package for R by Bob Rudis can extract the latest trending topics from the TV news stations.
The newsflash package is an interface to the GDELT Project's Television Explorer, which provides access to the closed-captioning transcripts from seven major cable-news stations, with archives available for the past 6 years. In particular, it provides access to the top trending "entities" (in the sense of the Stanford Names Entity Recognizer), ranked by the number of sentences in which they are mentioned during the last 24 hours. You can see R code extracting the rankings here.
The newsflash package is still in alpha-test mode and only available on Github (and not yet on CRAN). Also, it seems that the GDELT API can be a little unreliable and sometimes fails to return results. Nonetheless, it looks to be a useful resource for exploring what the TV news networks are reporting.
On August 26 Hurricane Harvey became the largest hurricane to make landfall in the United States in over 20 years. (That record may yet be broken by Irma, now bearing down on the Florida peninsula.) Harvey's rains brought major flooding to Houston and other coastal areas in the Gulf of Mexico. You can see the rainfall generated by Harvey across Texas and Louisiana in this animation from the US Geological Survey of county-by-county precipitation as the storm makes its way across land.
Interestingly, the heaviest rains appear to fall somewhat away from the eye of the storm, marked in orange on the map. The animation features Harvey's geographic track, along with a choropleth of hourly rainfall totals and a joyplot of river gage flow rates, and was created using the R language. You can find the data and R code on Github, which makes use of the USGS's own vizlab package which facilitates the rapid assembly of web-ready visualizations.
You can find more information about the animation, including the web-based interactive version, at the link below.
Last year, Buzzfeed broke the story that US law enforcement agencies were using small aircraft to observe points of interest in US cities, thanks to analysis of public flight-records data. With the data journalism team no doubt realizing that the Flightradar24 data set hosted many more stories of public interest, the challenge lay in separating routine, day-to-day aircraft traffic from the more unusual, covert activities.
So they trained an artificial intelligence model to identify unusual flight paths in the data. The model, implemented in the R programming language, applies a random forest algorithm to identify flight patterns similar to those of covert aircraft identified in their earlier "Spies in the Skies" story. When that model was applied to the almost 20,000 flights in the FlightRadar24 dataset, about 69 planes were flagged as possible surveillance aircraft. Several of those were false positives, but further journalistic inquiry into the provenance of the registrations led to several interesting stories.
Using this model, Buzzfeed news identified several surveillance aircraft in action during a four-month period in late 2015. These included a spy plane operated by US Marshals to hunt drug cartels in Mexico; aircraft covertly registered to US Customs and Border Protection patrolling the US-Mexico border; and a US Navy contractor operating planes circling several points over land in the San Francisco Bay Area β ostensibly for harbor porpoise research.
by Juan M. Lavista Ferres , Senior Director of Data Science at Microsoft
In what was one of the most viral episodes of 2017, political science Professor Robert E Kelly was live on BBC World News talking about the South Korean president being forced out of office when both his kids decided to take an easy path to fame by showing up in their dadβs interview.
The video immediately went viral, and the BBC reported that within five days more than 100 million people from all over the world had watched it. Many people around the globe via Facebook, Twitter and reporters from reliable sources like Time.com thought the woman that went after the children was her nanny, when in fact, the woman in the video was Robertβs wife, Jung-a Kim, who is Korean.
We decided to embrace the uncertainty and take a data science based approach to estimating the chances that the person was the nanny or the mother of the child, based on the evidence people had from watching the news.
@David_Waddell What would that mean, please? Re-broadcasting it on BBC TV, or just here on Twitter? Is this kinda thing that goes 'viral' and gets weird?
We define the following Bayesian network using the bnlearn package for R. We create the network using the model2network function and then we input the conditional probability tables (CPTs) that we know at each node.
library(bnlearn)
set.seed(3)
net <- model2network("[HusbandDemographics][HusbandIsProfessional][NannyDemographics][WifeDemographics|HusbandDemographics][StayAtHomeMom|HusbandIsProfessional:WifeDemographics][HouseholdHasNanny|StayAtHomeMom:HusbandIsProfessional][Caretaker|StayAtHomeMom:HouseholdHasNanny][CaretakerEthnicity|WifeDemographics:Caretaker:NannyDemographics]")
plot(net)
The last step is to fit the parameters of the Bayesian network conditional on its structure, the bn.fit function runs the EM algorithm to learn CPT for all different nodes in the above graph.
Once we have the model, we can query the network using cpquery to estimate the probability of the events and calculate the probability that the person is the nanny or the wife based on the evidence we have (husband is Caucasian and professional, caretaker is Asian). Based on this evidence the output is that the probability that she is the wife is 90% vs. 10% that she is the nanny.
probWife <- cpquery(net.disc, (Caretaker=="wife"),HusbandDemographics=="caucacian"& HusbandIsProfessional=="yes"& CaretakerEthnicity=="asian",n=1000000)
probNanny <- cpquery(net.disc, (Caretaker=="nanny"),HusbandDemographics=="caucacian"& HusbandIsProfessional=="yes"& CaretakerEthnicity=="asian",n=1000000)
[1] "The probability that the caretaker is his wife = 0.898718647764449"
[1] "The probability that the caretaker is the nanny = 0.110892031547457"
In conclusion, if you thought the woman in the video was the nanny, you may need to review your priors!
FiveThirtyEight published a fascinating article this week about the subreddits that provided support to Donald Trump during his campaign, and continue to do so today. Reddit, for those not in the know, is an popular online social community organized into thousands of discussion topics, called subreddits (the names all begin with "r/"). Most of the subreddits are a useful forum for interesting discussions by like-minded people, and some of them are toxic. (That toxicity extends to some of the names, which is reflected in some of the screenshots below β apologies in advance.) The article looks at various popular and notorious subreddits and finds those that are most similar to the main subreddit devoted to Donald Trump and also to the main other contenders in the 2016 campaign for president, Hillary Clinton and Bernie Sanders.
The underlying method used to compare subreddits for this purpose is quite ingenious. It's based on a concept you might call "subreddit algebra": you can "add" two subreddits and find a third that reflects the intersection of the two. (One example they give is adding r/nba to r/minnesota gives you r/timberwolves, the subreddit for Minnesota's NBA team.) The you can apply the same process to subtraction: if you remove all the posts like those in the mainstream r/politics site from those in r/The_Donald you're left with posts that look like those in several toxic subreddits.
The statistical technique used to identify posts that are "similar" to another is Latent Semantic Analysis, and the article gives this nice illustration of using it to compare subreddits:
The analysis was performed in R, and the code is available in GitHub. The code makes heavy use of the lsa package for R, which provides a number of functions for performing latent semantic analysis. The triangular plot shown above β known as a ternary diagram β was created using the ggtern package.
For the complete subreddit analysis, and the list of subreddits close to Donald Trump based on the analysis, check out the FiveThirtyEight article linked below.
With thanks to NOAA's incredible data gathering and forecasting activities, I've been obsessed with this chart for the past few days:
We used to live near the Napa river where this river gage is located, and still have many friends in the area. We were in the area last weekend, when a "pineapple express" weather event brought an atmospheric river over much of California, with much rain and some flooding in low-lying areas. This was just before the first peak in the chart above, which shows the water level in the Napa river (in blue) along with a NOAA forecast (in purple). I was checking this chart obsessively, as the observed water level approached the "Major Flood" level, and experienced alternate bouts of hope and fear as the forecast skirted above the line from time to time.
Relying on this chart so intently made me appreciate what is takes to make a useful chart, so let's look at the ways this particular chart stands out. (While NOAA does use R for some hydrological charts, I don't think R was used for this one.)
The chart is updated frequently, and the most recent data point is highlighted. New river levels were posted every 15 minutes, and at as the crest was peaking knowing how recent the data were was critical.
A forecast is provided. The purple dots are based on a hydrological forecast, which includes information from upstream gages, weather forecasts, and the river formation around this particular location. This was an incredibly useful tool during the flood threat. However, the forecast is only updated every few hours, so having the recency of the forecast on the chart was incredibly helpful.
Context is provided for the measurements and forecast. I hadn't really paid much mind to the river level before β most of the time it's not much more than a minor stream. But knowing what river levels represented minor, moderate or major flooding (with their detailed definitions) was important. (As you can see, the river just avoided the major flooding stage on Sunday, and indeed the local town stayed mostly dry. Some vineyards were flooded, though)
Time zones are provided with times. There's nothing more frustrating than looking at a date or time, not knowing what time zone the data are provided in. This chart includes both the local time zone (PST) for the main axis and annotations and, on the top axis, Coordinated Universal Time. (17Z refers to 5PM Zulu time, which is 8 hours ahead of PST.)
The second Y axis. Having a second Y axis on a chart is rarely a good idea, but this is one of the examples where it's useful. The river flow is directly (but nonlinearly) related to river height, so presenting it here on the Y axis is useful for those that need it. (This is actually the value β not river height β used as input to the forecasts.) But while bridge engineers care about river flow, most are more concerned about the height, which is given top billing on the main Y axis. Bonus credit: units are provided for both axes: always a must-have, but lamentably often forgotten.
Annotations are provided for context. Having the recent and forecast peak heights and therecord flood height included on the chart provided context for the severity of the current flood threat, especially if you had experieced prior flood events in the area.
The chart is in PNG format. That's a good choice for a chart like this: it's a lossless format, which means the data points appear in perfect fidelity. It's also a fairly compact format that keeps image sizes small β important for a website that may experience a lot of traffic from many people constantly refreshing the report. (Thoughtfully, an auto-refresh option was also provided.) JPG, a lossy format that blurs small data points and straight lines, would have been a terrible choice here.
That's not to say this chart gets everything right. More resolution would have been helpful (especially when trying to comare the last data point to the prior -- is the river level rising or falling?). The color key for the flood stages is far from the chart on the webpage, rather than being included on the chart itself. The NOAA logo is a bit intrusive (though I understand why it's there). And in general the styling is could use an update (pseudo-3D chrome is so last century). But this chart gets many more things tahn it gets wrong, and provides a useful lesson in presenting data graphically that people can actually use.
As you can see from the chart, another flood-level crest is now heading down the river. As things stand now, it doesn't seem like it's going to be as severe as the one on Sunday, but to everyone in the affected areas: good luck, take care, and give thanks to NOAA for keeping us all informed.
Few people expect politicians to write every word they utter themselves; reliance on speechwriters and spokepersons is a long-established political practice. Still, it's interesting to know which statements are truly the politician's own words, and which are driven primarily by advisors or influencers.
By looking at the aspects of linguistic style (word/sentence length, frequency of word pairings, use of punctuation, etc.) of the speeches of the Prime Minister Nawaz Sharif, Ali found suggestions of at least 2 authors (and possibly more) behind the speeches. This is particularly apparent in this consensus network of appearances of 4-character sequences in speeches, which divides them into two clusters (of possibly differing authorship).
Ali used R and several packages to perform the analysis. These included the openNLP package to extract attributes from the speech data, the stylo package for stylometric analysis, the fpc package for the clustering, and the igraph package to visualize the clusters. The complete R script used for the analysis is available on Github.
For an overview of the analysis, check out this slide presentation by Ali, and for the complete details take a look at the blog post linked below.
It's Thanksgiving day here in the US, so we're taking the rest of the week off to reflect on what we're thankful for. And even if you're not in the US, today is a great day to send thanks to the R Core Group for providing their dedication, time, and expertise to make the R Project what it is today.
(Sadly, cowsay doesn't feature a Thanksgiving turkey.) We'll be back next week. For those that are celebrating, have a great holiday. See you on Monday!