The Wayback Machine - https://web.archive.org/web/20260526141251/https://opensource.googleblog.com/search/label/science

opensource.google.com

Menu
Showing posts with label science. Show all posts
Showing posts with label science. Show all posts

arXiv LaTeX cleaner: safer and easier open source research papers

Friday, February 22, 2019

Open source is usually associated with code behind utilities and applications, though you can find it in many other places: such as the LaTeX source code that describes the PDFs of scientific papers.

As an example, the following source code:


Generates this PDF when compiled using pdflatex:
You can see a huge repository of such open source code at arXiv.org, an open access repository of scientific papers currently containing about 1.5 million entries (140,616 uploads in 2018). One can not only download all papers in PDF format, but also obtain the source code to regenerate them and freely reuse any of their parts.

Open sourcing LaTeX code, however, comes with its risks and challenges. We’ve built and released the code of arXiv LaTeX cleaner to remedy some of these.

Scrubbing the Code

The main risk one faces when sharing LaTeX code with the world is accidentally releasing private information, primarily through commented code left over in the file itself.

While authors put a lot of effort into polishing the final PDF, the code isn’t usually cleaned up and is left with many pieces of text that don’t actually appear in the PDF. Things like, “I do not see why the following statement should be correct,” or “Look, I’m citing you!,” make it into arXiv for everyone to see. This happens so often there’s even a Twitter bot that finds and publishes them!

Cleaning up this commented out code manually is laborious, so arXiv LaTeX cleaner automatically removes it for you.

Private information can also be found in the many auxiliary files that LaTeX generates when the code is compiled. Some of them are needed in arXiv (e.g., .bbl files), some of them are not: arXiv LaTeX cleaner will delete the unneeded ones and keep the rest automatically.

Cleaning and Autoscaling Images

Challenges also come our way when preparing the code to submit to arXiv: one needs to upload a ZIP file smaller than 10 MBytes. With high resolution pictures and figures, it’s easy to go beyond the limit.

Manually resizing images and deleting images that aren’t actually in the final version is time consuming and cumbersome, so arXiv LaTeX cleaner does that automatically, too. If there’s a very intricate figure you’d like to keep in high resolution, you can specify a list of images and their expected resolution.

We hope that, by making open sourcing research papers faster and safer, arXiv LaTeX cleaner will help even more researchers embrace open access and make their work freely available.

arXiv LaTeX cleaner itself is open source, so you can adapt it to your needs. If you think your adaptation would be useful for others, we’d love your contributions, too.

By Jordi Pont-Tuset, Machine Perception team

Open Sourcing the Hunt for Exoplanets

Friday, March 9, 2018



Recently, we discovered two exoplanets by training a neural network to analyze data from NASA’s Kepler space telescope and accurately identify the most promising planet signals. And while this was only an initial analysis of ~700 stars, we consider this a successful proof-of-concept for using machine learning to discover exoplanets, and more generally another example of using machine learning to make meaningful gains in a variety of scientific disciplines (e.g. healthcare, quantum chemistry, and fusion research).

Today, we’re excited to release our code for processing the Kepler data, training our neural network model, and making predictions about new candidate signals. We hope this release will prove a useful starting point for developing similar models for other NASA missions, like K2 (Kepler’s second mission) and the upcoming Transiting Exoplanet Survey Satellite mission. As well as announcing the release of our code, we’d also like take this opportunity to dig a bit deeper into how our model works.

A Planet Hunting Primer

First, let’s consider how data collected by the Kepler telescope is used to detect the presence of a planet. The plot below is called a light curve, and it shows the brightness of the star (as measured by Kepler’s photometer) over time. When a planet passes in front of the star, it temporarily blocks some of the light, which causes the measured brightness to decrease and then increase again shortly thereafter, causing a “U-shaped” dip in the light curve.
A light curve from the Kepler space telescope with a “U-shaped” dip that indicates a transiting exoplanet.
However, other astronomical and instrumental phenomena can also cause the measured brightness of a star to decrease, including binary star systems, starspots, cosmic ray hits on Kepler’s photometer, and instrumental noise.
The first light curve has a “V-shaped” pattern that tells us that a very large object (i.e. another star) passed in front of the star that Kepler was observing. The second light curve contains two places where the brightness decreases, which indicates a binary system with one bright and one dim star: the larger dip is caused by the dimmer star passing in front of the brighter star, and vice versa. The third light curve is one example of the many other non-planet signals where the measured brightness of a star appears to decrease.
To search for planets in Kepler data, scientists use automated software (e.g. the Kepler data processing pipeline) to detect signals that might be caused by planets, and then manually follow up to decide whether each signal is a planet or a false positive. To avoid being overwhelmed with more signals than they can manage, the scientists apply a cutoff to the automated detections: those with signal-to-noise ratios above a fixed threshold are deemed worthy of follow-up analysis, while all detections below the threshold are discarded. Even with this cutoff, the number of detections is still formidable: to date, over 30,000 detected Kepler signals have been manually examined, and about 2,500 of those have been validated as actual planets!

Perhaps you’re wondering: does the signal-to-noise cutoff cause some real planet signals to be missed? The answer is, yes! However, if astronomers need to manually follow up on every detection, it’s not really worthwhile to lower the threshold, because as the threshold decreases the rate of false positive detections increases rapidly and actual planet detections become increasingly rare. However, there’s a tantalizing incentive: it’s possible that some potentially habitable planets like Earth, which are relatively small and orbit around relatively dim stars, might be hiding just below the traditional detection threshold — there might be hidden gems still undiscovered in the Kepler data!

A Machine Learning Approach

The Google Brain team applies machine learning to a diverse variety of data, from human genomes to sketches to formal mathematical logic. Considering the massive amount of data collected by the Kepler telescope, we wondered what we might find if we used machine learning to analyze some of the previously unexplored Kepler data. To find out, we teamed up with Andrew Vanderburg at UT Austin and developed a neural network to help search the low signal-to-noise detections for planets.
We trained a convolutional neural network (CNN) to predict the probability that a given Kepler signal is caused by a planet. We chose a CNN because they have been very successful in other problems with spatial and/or temporal structure, like audio generation and image classification.
Luckily, we had 30,000 Kepler signals that had already been manually examined and classified by humans. We used a subset of around 15,000 of these signals, of which around 3,500 were verified planets or strong planet candidates, to train our neural network to distinguish planets from false positives. The inputs to our network are two separate views of the same light curve: a wide view that allows the model to examine signals elsewhere on the light curve (e.g., a secondary signal caused by a binary star), and a zoomed-in view that enables the model to closely examine the shape of the detected signal (e.g., to distinguish “U-shaped” signals from “V-shaped” signals).

Once we had trained our model, we investigated the features it learned about light curves to see if they matched with our expectations. One technique we used (originally suggested in this paper) was to systematically occlude small regions of the input light curves to see whether the model’s output changed. Regions that are particularly important to the model’s decision will change the output prediction if they are occluded, but occluding unimportant regions will not have a significant effect. Below is a light curve from a binary star that our model correctly predicts is not a planet. The points highlighted in green are the points that most change the model’s output prediction when occluded, and they correspond exactly to the secondary “dip” indicative of a binary system. When those points are occluded, the model’s output prediction changes from ~0% probability of being a planet to ~40% probability of being a planet. So, those points are part of the reason the model rejects this light curve, but the model uses other evidence as well - for example, zooming in on the centred primary dip shows that it's actually “V-shaped”, which is also indicative of a binary system.

Searching for New Planets

Once we were confident with our model’s predictions, we tested its effectiveness by searching for new planets in a small set 670 stars. We chose these stars because they were already known to have multiple orbiting planets, and we believed that some of these stars might host additional planets that had not yet been detected. Importantly, we allowed our search to include signals that were below the signal-to-noise threshold that astronomers had previously considered. As expected, our neural network rejected most of these signals as spurious detections, but a handful of promising candidates rose to the top, including our two newly discovered planets: Kepler-90 i and Kepler-80 g.

Find your own Planet(s)!

Let’s take a look at how the code released today can help (re-)discover the planet Kepler-90 i. The first step is to train a model by following the instructions on the code’s home page. It takes a while to download and process the data from the Kepler telescope, but once that’s done, it’s relatively fast to train a model and make predictions about new signals. One way to find new signals to show the model is to use an algorithm called Box Least Squares (BLS), which searches for periodic “box shaped” dips in brightness (see below). The BLS algorithm will detect “U-shaped” planet signals, “V-shaped” binary star signals and many other types of false positive signals to show the model. There are various freely available software implementations of the BLS algorithm, including VARTOOLS and LcTools. Alternatively, you can even look for candidate planet transits by eye, like the Planet Hunters.
A low signal-to-noise detection in the light curve of the Kepler 90 star detected by the BLS algorithm. The detection has period 14.44912 days, duration 2.70408 hours (0.11267 days) beginning 2.2 days after 12:00 on 1/1/2009 (the year the Kepler telescope launched).
To run this detected signal though our trained model, we simply execute the following command:
python predict.py  --kepler_id=11442793 --period=14.44912 --t0=2.2
--duration=0.11267 --kepler_data_dir=$HOME/astronet/kepler 
--output_image_file=$HOME/astronet/kepler-90i.png 
--model_dir=$HOME/astronet/model
The output of the command is prediction = 0.94, which means the model is 94% certain that this signal is a real planet. Of course, this is only a small step in the overall process of discovering and validating an exoplanet: the model’s prediction is not proof one way or the other. The process of validating this signal as a real exoplanet requires significant follow-up work by an expert astronomer — see Sections 6.3 and 6.4 of our paper for the full details. In this particular case, our follow-up analysis validated this signal as a bona fide exoplanet, and it’s now called Kepler-90 i!
Our work here is far from done. We’ve only searched 670 stars out of 200,000 observed by Kepler — who knows what we might find when we turn our technique to the entire dataset. Before we do that, though, we have a few improvements we want to make to our model. As we discussed in our paper, our model is not yet as good at rejecting binary stars and instrumental false positives as some more mature computer heuristics. We’re hard at work improving our model, and now that it’s open sourced, we hope others will do the same!


By Chris Shallue, Senior Software Engineer, Google Brain Team

If you’d like to learn more, Chris is featured on the latest episode of This Week In Machine Learning & AI discussing his work.

Open source visualization of GPS displacements for earthquake cycle physics

Thursday, November 10, 2016

The Earth’s surface is moving, ever so slightly, all the time. This slow, small, but persistent movement of the Earth's crust is responsible for the formation of mountain ranges, sudden earthquakes, and even the positions of the continents. Scientists around the world measure these almost imperceptible movements using arrays of Global Navigation Satellite System (GNSS) receivers to better understand all phases of an earthquake cycle—both how the surface responds after an earthquake, and the storage of strain energy between earthquakes.

To help researchers explore this data and better understand the Earthquake cycle, we are releasing a new, interactive data visualization which draws geodetic velocity lines on top of a relief map by amplifying position estimates relative to their true positions. Unlike existing approaches, which focus on small time slices or individual stations, our visualization can show all the data for a whole array of stations at once. Open sourced under an Apache 2 license, and available on GitHub, this visualization technique is a collaboration between Harvard’s Department of Earth and Planetary Sciences and Google's Machine Perception and Big Picture teams.

Our approach helps scientists quickly assess deformations across all phases of the earthquake cycle—both during earthquakes (coseismic) and the time between (interseismic). For example, we can see azimuth (direction) reversals of stations as they relate to topographic structures and active faults. Digging into these movements will help scientists vet their models and their data, both of which are crucial for developing accurate computer representations that may help predict future earthquakes.

Classical approaches to visualizing these data have fallen into two general categories: 1) a map view of velocity/displacement vectors over a fixed time interval and 2) time versus position plots of each GNSS component (longitude, latitude and altitude).

Examples of classical approaches. On the left is a map view showing average velocity vectors over the period from 1997 to 2001[1]. On the right you can see a time versus eastward (longitudinal) position plot for a single station.

Each of these approaches have proved to be informative ways to understand the spatial distribution of crustal movements and the time evolution of solid earth deformation. However, because geodetic shifts happen in almost imperceptible distances (mm) and over long timescales, both approaches can only show a small subset of the data at any time—a condensed average velocity per station, or a detailed view of a single station, respectively. Our visualization enables a scientist to see all the data at once, then interactively drill down to a specific subset of interest.

Our visualization approach is straightforward; by magnifying the daily longitude and latitude position changes, we show tracks of the evolution of the position of each station. These magnified position tracks are shown as trails on top of a shaded relief topography to provide a sense of position evolution in geographic context.

To see how it works in practice, let’s step through an an example. Consider this tiny set of longitude/latitude pairs for a single GNSS station, with the differing digits shown in bold:


Day IndexLongitudeLatitude
0139.0699040734.949757897
1139.0699040034.949757882
2139.0699041334.949757941
3139.0699040934.949757921
4139.0699041334.949757904

If we were to draw line segments between these points directly on a map, they’d be much too small to see at any reasonable scale. So we take these minute differences and multiply them by a user-controlled scaling factor. By default this factor is 105.5 (about 316,000x).


To help the user identify which end is the start of the line, we give the start and end points different colors and interpolate between them. Blue and red are the default colors, but they’re user-configurable. Although day-to-day movement of stations may seem erratic, by using this method, one can make out a general trend in the relative motion of a station.
Close-up of a single station’s movement during the three year period from 2003 to 2006.
However, static renderings of this sort suffer from the same problem that velocity vector images do; in regions with a high density of GNSS stations, tracks overlap significantly with one another, obscuring details. To solve this problem, our visualization lets the user interactively control the time range of interest, the amount of amplification and other settings. In addition, by animating the lines from start to finish, the user gets a real sense of motion that’s difficult to achieve in a static image.

We’ve applied our new visualization to the ~20 years of data from the GEONET array in Japan. Through it, we can see small but coherent changes in direction before and after the great 2011 Tohoku earthquake.
GPS data sets (in .json format) for both the GEONET data in Japan and the Plate Boundary Observatory (PBO) data in the western US are available at earthquake.rc.fas.harvard.edu.
This short animation shows many of the visualization’s interactive features. In order:
  1. Modifying the multiplier adjusts how significantly the movements are magnified.
  2. We can adjust the time slider nubs to select a particular time range of interest.
  3. Using the map controls provided by the Google Maps JavaScript API, we can zoom into a tiny region of the map.
  4. By enabling map markers, we can see information about individual GNSS stations.
By focusing on a stations of interest, we can even see curvature changes in the time periods before and after the event.
Station designated 960601 of Japan’s GEONET array is located on the island of Mikura-jima. Here we see the period from 2006 to 2012, with movement magnified 105.1 times (126,000x).
To achieve fast rendering of the line segments, we created a custom overlay using THREE.js to render the lines in WebGL. Data for the GNSS stations is passed to the GPU in a data texture, which allows our vertex shader to position each point on-screen dynamically based on user settings and animation.

We’re excited to continue this productive collaboration between Harvard and Google as we explore opportunities for groundbreaking, new earthquake visualizations. If you’d like to try out the visualization yourself, follow the instructions at earthquake.rc.fas.harvard.edu. It will walk you through the setup steps, including how to download the available data sets. If you’d like to report issues, great! Please submit them through the GitHub project page.

Acknowledgments

We wish to thank Bill Freeman, a researcher on Machine Perception, who hatched the idea and developed the initial prototypes, and Fernanda Viégas and Martin Wattenberg of the Big Picture Team for their visualization design guidance.

References

[1] Loveless, J. P., and Meade, B. J. (2010). Geodetic imaging of plate motions, slip rates, and partitioning of deformation in Japan, Journal of Geophysical Research.

By Jimbo Wilson, Software Engineer, Big Picture Team and Brendan Meade, Professor, Harvard Department of Earth and Planetary Sciences

Google Summer of Code 2016 wrap-up: HUES Platform

Wednesday, October 12, 2016

Every year Google Summer of Code pairs university students with mentors to hone their skills while working on open source projects, and every year we like to post wrap-ups from the open source projects about their experience and what students accomplished. Stay tuned for more!



The Holistic Urban Energy Simulation (HUES) platform is an open source platform for facilitating the design and control of renewables-based distributed energy systems. The platform is an initiative of the Urban Energy Systems Laboratory at Empa in Switzerland, in collaboration with our research partners at ETH-Zurich, EPFL, the University of Geneva and the Lucerne University of Applied Sciences. As we push towards the second version of the HUES platform, we had help from three bright and enthusiastic students as part of the Google Summer of Code (GSoC).

Project 1: Real-time wind flow in cities
Air flow pattern around a building configuration (left); link to Rhinoceros/Grasshopper (middle & right)
People in cities are suffering more and more from scorching heat, caused by global warming and bad urban planning. This traps heat inside cities and has led to soaring air conditioning demand, making cities even hotter - a vicious circle!  Clever bioclimatic urban design can mitigate urban heat by facilitating the use of natural ventilation and guiding air streams. However, the simulation of wind flow is a computationally and technically demanding task. There is a need to provide urban planners and architects with a tool able to predict wind flow patterns in real-time to facilitate development of energy efficient and passive designs.

Lukas Bystricky, a student at Florida State University, developed a Fast Fluid Dynamics (FFD) library in C# exactly for this purpose. Lukas’s implementation is based on the  paper by Jos Stam (1999). In contrast to the original implementation, where a cell centred finite difference is used to discretize the equations, Lukas applies a staggered grid finite difference, which is the standard finite difference in Computational Fluid Dynamics (CFD). This is done to prevent spurious pressure oscillations near the boundary which can occur in cell centered finite difference for the Navier-Stokes equations. This does not change much in the algorithm or solvers, but makes enforcing the boundary conditions significantly more complicated. So far, Lukas uses a simple Jacobi solver as linear solver, as was the case in Stam's original implementation, but he plans to replace it with more efficient solvers in the future. Also, he is validating his library with typical benchmarks. 

We are now coupling Lukas’s library into our HUES platform, more specifically into the 3D CAD software Rhinoceros and its visual programming platform Grasshopper. The final goal is to have an intuitive real-time visual design tool of wind flow for urban planners and architects. Also, we will use it to couple it to whole year dynamic building energy simulation programs, to better capture microclimatic effects of the urban context in simulating building energy consumption of designs.

Project 2: Modular energy hub modeling framework
A connection between two bus objects in a CopyHub container
Distributed energy system components are modular in nature and interact across multiple scales. As such, there is a need for a modeling framework that can easily construct and configure systems of modular entities (energy demands, sources, converters, storages and network links) across scales. Frederik Banis, a student at the University of Applied Sciences Stuttgart, developed a modular approach to modeling distributed multi-energy systems (energy hubs) in Python, based on the Open Energy System Modelling Framework (Oemof) and Pyomo

In the developed framework, energy systems components are specified in a common format allowing for easy duplication and reconfiguring at larger scales. The platform enables easy manipulation of an energy hub grouping multiple components (demand, sources: electricity, natural gas; systems: photovoltaic panels, wind turbines, gas boils, combined heat and power engines, etc.), as well as copying it (from hub1 to hub2) to create a larger interlinked system (district) where multiple energy hubs are connected. This hierarchical nested structure can be repeated as needed, and detailed results about the energy supply of each technology or energy stream can be analyzed in the form of different plots for each system or sub-system.

Project 3: Open source energy simulation database

The HUES platform includes a growing array of datasets describing the technical and economic characteristics of distributed energy technologies.  Currently, this data is stored in separate modules using different data structures and file formats, making it difficult to explore holistically and query systematically. To address this, GSoC student Khushboo Mandlecha has developed an open source database to enable the linked exploration, querying and visualization of data in the platform. 

The first part of the project involved the development of server based scripts to automatically extract relevant data from the modules of the existing HUES platform, and write this data to a common database. A standard format for technology component data was developed, enabling users to upload technology data files to be stored in the new database.  The new database has been developed in MongoDB, enabling fast data retrieval and allowing everything to be retrieved in the form of JSON objects. The second part of the project involved the development of a web-based portal for querying, visualizing and downloading data. Once this portal is complete, it will be possible to visualize the contents of the database in different ways, enabling users to get a sense of the distribution of property values and facilitating the identification of outliers.  Ultimately, the database will help researchers and practitioners using the HUES platform to develop models and perform comprehensive analyses of distributed energy systems.

By L. Andrew Bollinger, Julien Marquant and Christoph Waibel; Urban Energy Systems Laboratory, Empa, Switzerland

Opening up Science Journal

Friday, August 19, 2016

Science Journal is an app that turns your Android phone into a mobile science tool, allowing you to use the sensors in your phone to explore the world around you. The Making & Science team launched Science Journal a few months ago at Bay Area Maker Faire 2016 and have been excited to see different projects people have done with it all over the world!

Today we are happy to announce that we are releasing Science Journal 1.1 on the Google Play Store and also publishing the core source for the app. Open source software and hardware has been hugely beneficial to the science education ecosystem. By open sourcing, we’ll be able to improve the app faster and also to provide the community with an example of a modern Android app built with Material Design principles.

One important feature in Science Journal is the ability to connect to external devices over Bluetooth LE. We have open source firmware which runs on several Arduino microcontrollers already. In the near future, we will provide alternate ways to get your sensor data into Science Journal: stay tuned (or follow along with our commits)!

We believe that anyone can be a scientist anywhere. Science doesn’t just happen in the classroom or lab. Tools like Science Journal let you see how the world works with just your phone and now you can explore how Science Journal itself works, too. Give it a try and let us know what you think!

By Justin Koh, Software Engineer

CERN Summer Thrills

Tuesday, December 18, 2012



For the physics software development group at CERN, our second year of Google Summer of Code couldn’t have come at a better time. Motivated by CernVM's awesome experience in 2011, our colleagues from the Geant4 and ROOT software projects joined us as mentors this summer. And while physicists around the world snatched the first evidence of a long-sought Higgs boson from the Large Hadron Collider (LHC), our seven Google Summer of Code students worked on core parts of the open source software engine that makes LHC data processing possible.

Two of our students worked with the Geant4 team at CERN. Geant4 is a toolkit for the simulation of the response of a material when high-energetic particles are passing through it. Geant4 can be used to model a gas detector, a gamma-ray telescope, an electronic device next to an accelerator or the inside of a satellite. In order to keep up with the rate of real data coming from the LHC detectors, Geant4 has to be both accurate and fast.

  • Stathis Kamperis improved the speed of Geant4 by re-ordering the simulation of particles according to particle type. By simulating, for instance, first all electrons, than all photons, and so on, the number of instruction cache misses decreases. In the course of this work, Stathis also ported Geant4 to Solaris which gives us access to the very powerful DTrace profiling machinery.
  • Dhruva Tirumala Bukkapatnam contemplated Geant4 pointers and data structures. He developed code for a particle navigation algorithm optimized for use on GPU architectures.

Two more students were working together with the ROOT team. The ROOT framework is the main working horse for LHC experiments to store, analyze, and visualize their data.

  • Omar Zapata Mesa worked on an MPI interface for ROOT. On a cluster of machines, the MPI interface enables ROOT to toss around its C++ objects from node to node while also integrating with ROOT's C++ interpreter.
  • Eamon Ford worked on the CERN iOS App.The App brings CERN news and information on an iPad or iPhone. In case you can’t sleep at night, you can now peek at the live display of particle collisions from inside the LHC.
For the CernVM base technology, we had three more students working with us this summer. CernVM provides a virtual appliance used to develop and run LHC data processing applications on the distributed and heterogeneous computing infrastructure that is provided by hundreds of physics institutes and research labs around the world.

  • Josip Lisec, back for his second Google Summer of Code, worked on the log analysis and visualization of CernVM Co-Pilot, the job distribution system which powers the LHC@home Test4Theory volunteer computing project. Want to see the world map of active volunteers from the 19th of November at 3:07pm?  Check out the Co-Pilot dashboard.
  • Francesco Ruvolo worked on broken network services, such as misconfigured DNS or HTTP servers. Breaking such services in a controlled way comes in handy when simulating the behavior of a CernVM running on a hotel WiFi.
  • Rachee Singh programmed maintenance tools for the content distribution network that is used by the CernVM File System to distribute terabytes of experiment software to all the worker nodes. All the proxy servers of the content distribution network can now be plotted on a map and every CernVM can automatically find a close proxy by means of a Voronoi diagram produced by Rachee's code.

Overall, we were very glad to see so much interest and enthusiasm from the student programmers in LHC software tools. We'd like to congratulate all of our students on their hard work and on successfully completing the program!

By Jacob Blomer, CERN Organization Administrator

2's Company, 3+ is a Crowd

Thursday, December 6, 2012



The Crowdsourcing Biology team at the Scripps Research Institute participated in the Google Summer of Code for the first time this year.  Five students contributed to efforts to harness the power of community intelligence to advance biomedical science.

Maximilian Ludvigsson took the first steps in the creation of Semantic BioGPS.  BioGPS is a user-extensible Web portal that provides easy access to information about genes from hundreds of different websites.  Maxmilian produced a tool that allows BioGPS users to annotate regions of gene-centric Web pages to state, computationally, what different areas of the page ‘mean’.  These semantic annotations enable scripts to extract structured content about genes from these Web pages, paving the way for a new version of BioGPS that provides integrated views across multiple data sources.

Karthik G developed an interactive network visualization for the data linking genes to diseases in the GeneWiki+.  The GeneWiki+ is a Semantic Media Wiki (SMW) installation that dynamically integrates data about human genes from Wikipedia and from SNPedia.  While SMW queries provide a great way for programmers and advanced wiki users to interact with data, the graphical network that Karthik created gives ordinary biologists a new, intuitive, and sometimes beautiful way to explore connections between genes and disease.

Clarence Leung began the development of a new version of the crowdsourcing game Dizeez.  In this new two-player game, players are challenged to get their partner to guess a particular disease by prompting them with related genes.  This game follows in the tradition of ‘games with a purpose’ such as Foldit and the ESP game by producing novel, validated gene-disease associations as a result of game play.

Shivansh Srivastava worked on migrating BioGPS’s gene report layout windowing system from ExtJS to both a jQuery windowing environment and a Yahoo User Interface-based approach.  This view in BioGPS provides biologists with a customizable environment for accessing gene-centric data from a diverse collection of sources.  Shivansh’s efforts provided BioGPS developers with insight into the technical limitations of each solution, as compared to the current BioGPS ExtJS codebase.

Kevin Wu developed a scalable and efficient system for storing and analyzing biologically meaningful sets of genes.  Accessible via a RESTful HTTP interface, the system uses MongoDB for storage and custom code for distributed computing that executes statistical comparisons across thousands of gene sets in parallel.  For any particular gene set, Kevin’s code makes it possible to rapidly identify similar gene sets and to calculate the ‘enrichment’ (a statistical measure of overlap) of that gene set with respect to any other.  This work will soon be integrated into BioGPS to allow users to save their own gene sets and to query for similar gene sets from others.

Thanks to all of our excellent students for their great contributions and to Google for sponsoring this unique program.  We are looking forward to participating in future editions of Google Summer of Code for many years to come!

By Benjamin Good and the Crowdsourcing Biology Google Summer of Code mentors

Scientists Camp Out* At Google

Wednesday, August 19, 2009

Last month, around 250 invited scientists and science-related journalists, artists, and educators flocked to Google for the fourth annual Science Foo Camp (SciFoo). SciFoo is an unconference jointly presented by Google, O'Reilly Media, and Nature Publishing.

In true Google tradition, the conference began with dinner, followed by an orientation session led by conference hosts Tim O'Reilly, Sara Winge, Chris DiBona, and Timo Hannay. Immediately after, the schedule for the weekend was created by attendees covering large boards with giant post-it notes of topics.


Tim O'Reilly, Sara Winge, Chris DiBona and Timo Hannay open the conference
(Photo Credit: Suhky Dhaliwal and Ellen Ko)



Filling in the schedule boards
(Photo Credit: Bertalan Meskó)

Attendees came from all branches of science and technology, and included luminaries such as Marvin Minksy, Louise Leakey, Peter Diamandis, Bill Nye, and George Smoot. But the conference isn't only for the famous. There were many physicists, biologists, psychiatrists, chemists, and almost every other -ist represented. Experimental poet Christian Bok and puzzle maker Pavel Curtis provided interesting views on many topics.

The sessions were as varied as the attendees. Things discussed included artificial intelligence, the challenges of science education, cartoon physics, space travel, climate change, swine flu, data sharing, microbes, and more. That list doesn't even begin to scratch the surface, or include conversations had when a rocket scientist and a computational biologist sit down at the same table for lunch.


Jam session at the campground
(Photo Credit: Suhky Dhaliwal and Ellen Ko)

The event wasn't all discussion, Google demo'ed a street view tricycle and a holodeck -- tools for collecting and displaying geodata. The holodeck even provided an opportunity to visit all of Earth and Mars.


Google Earth as seen in the holodeck
(Photo Credit: Suhky Dhaliwal and Ellen Ko)

You couldn't put this many scientists in one place without doing some real science. Dr. Larry Weiss brought supplies for performing MRSA screening. Googler volunteers discovered that you could get almost anyone to stick a giant q-tip up their nose in the name of science. Lapsed Googler Simon Quellen Field and Theodore Gray, co-founder of Wolfram Research, created ice cream with only milk, sugar, liquid nitrogen, and power tools.



Simon Quellen Field and Theodore Gray make liquid nitrogen ice cream


Baris Baser, SciFoo volunteer, describes SciFoo as "hands down one of my favorite events at Google. I really enjoy how it brings volunteers together from different offices and departments. The spontaneity makes it unpredictable and unique."

For more information on what went on at SciFoo '09 visit Nature's aggregator or Google Blog Search.

*and no, there was no actual camping at SciFoo Camp ;)

SciFoo: 200 of the World’s Top Scientists Meet at Google’s Annual Meeting of Really, Really Smart People

Friday, August 15, 2008



Organized in collaboration with Nature Publishing Group and O’Reilly Media (“FOO” stands for “Friends of O’Reilly”), and hosted at the Googleplex, the third annual Science Foo Camp (SciFoo) unconference boasts no predefined agenda. Rather, participants are invited to propose their session topics on a giant white board, in various time slots with eight sessions running concurrently.

Most academic conferences are highly specialized and attended time and again by the same people. Here, to promote fruitful cross-pollination, participants hail from dozens of science and technology disciplines, from biology and astrophysics to CS and nano-technology. Attendance is invitation only; in the interest of mixing things up, many of the 200+ participants are not invited twice. “SciFoo allows people at different institutions and from different disciplines to interact with each other,” says Open Source Programs Manager Chris DiBona, who spearheads SciFoo. “It gives them a rare chance to talk freely with each other in a private setting.”

This year, the conference was attended by Eric, Sergey, Larry Page, and Larry Brilliant of Google.org, along with a bevy of Google organizers and volunteers. The list of "campers" boasted four Nobel Prize winners (Sydney Brenner, Walter Gilbert, Andy Fire, and Frank Wilczek) and a laundry list of champions in the scientific community. Here are just a few: George Dyson (scientific historian), Brian Cox (physics popularizer, spokesman at CERN), Aubrey de Grey (biomedical gerontologist who studies "living young longer"), Eugenie Scott (director, National Center for Science Education), Brother Guy Consolmagno SJ (astromer at the Vatican), Neal Stephenson (science fiction writer), Nick Bostrom (transhumanist philosopher), Dan Tani (NASA astronaut, who has spent 131 days in space), Steward Brand (creator of The Whole Earth Catalog), Jill Bolte Taylor (neuroanatomist, author of the recent bestseller My Stroke of Insight [see TED talk]), notable theoretical physicists Lord Martin Rees (England's Astronomer Royal), Max Tegmark, Paul Davies, Lee Smolin, and renown oceanographer Sylvia Earle. To give the conference some umph, rocket scientist Carl Dietrich brought along a model of his Terrafugia "roadable aircraft," also known as a flying car, and Ian Wright parked his X1 all electric performance car, capable of 0-60 MPH in 3.07 seconds, by the dining tent.

Certain themes recurred. One was the need to do a better job of open sourcing data within the science community, including negative results; such sharing would enable collaboration and prevent scientists from "reinventing the wheel." A number of seminars also addressed the more quotidian concerns of studying science, from navigating office politics in academia to finding ways of making the discipline more exciting to young people. Many talks were also informed by specific social and humanitarian concerns, such as how Google can help detect emerging global pandemics, how genomic testing can help people prevent diseases, and, in a nutshell, what we can all do to ensure the long-term survival of the human race.

“A scientifically literate world is one that’s good for everyone,” DiBona says, summarizing the intent behind the conference. “People who are better educated will better understand what's possible on the Internet. As Googlers, I think it's incumbent on us to try to support basic science research and education around the world."

You can learn more about SciFoo by checking out the blog buzz and news coverage aggregated at Nature.com.
.