Solar activity directly impacts Earth, from GPS accuracy to power systems.
MackSun was designed to process billions of high frequency solar data points under strict hardware constraints, without relying on HPC infrastructure.
The platform is available at:
https://www.macksun.org
The problem
Instruments such as POEMAS (https://www.macksun.org/pages/wiki/arquivos-telescopios.html) operate with acquisition intervals around 10 milliseconds. This enables detailed analysis of solar activity, but also produces a continuous stream of data.
This creates a set of concrete challenges:
- continuous ingestion under load
- long term storage of billions of records
- memory and IO limitations
- processing under constant pressure
In most scenarios, this would require distributed systems or HPC clusters. Here, the system had to work without that.
Data origin
The data used in MackSun is not synthetic.
It comes from real solar observation instruments located in South America, operated at the CASLEO observatory in Argentina.
These instruments are managed by CRAAM, part of Mackenzie Presbiterian University in Brazil.
This matters because:
- data is generated under real observational conditions
- acquisition is continuous and subject to physical constraints
- system behavior is influenced by real hardware
This is not a controlled environment. It is a live acquisition scenario.
Infrastructure limits
The system runs under a constrained but well defined setup:
- single Linux server
- 16 vCPU
- 32 GB of RAM in total
- 4 GB reserved for the operating system
- 16 GB allocated to MongoDB running in sharded mode
- 12 GB allocated to the ingestion pipeline container
The MongoDB allocation is not arbitrary. It was defined based on limits observed during experimental validation.
Even on a single machine, MongoDB showed better performance in sharded mode. This is not assumed. It was experimentally validated and later published in Astronomy and Computing:
https://www.sciencedirect.com/science/article/pii/S221313372500126X
These limits are enforced. The system is designed to operate within them.
Data scale
The current volume is around:
- 3 billion data points
- continuous ingestion from solar instruments
- original data at high frequency
At this scale, uncontrolled growth leads to instability.
The system must control:
- memory usage
- write patterns
- data organization
- query behavior
Partitioning strategy
The system enforces a strict limit:
about 150 million data points per collection
Beyond this:
- performance degrades
- queries slow down
- memory pressure increases
Data is therefore split across multiple collections.
This is required for stability.
Ingestion model
The ingestion process is not real time.
It runs as a sequential pipeline with five stages, executed once per day.
This approach:
- avoids continuous load pressure
- keeps resource usage predictable
- simplifies failure handling
We chose batch processing over real time ingestion. This reduces latency flexibility, but guarantees stability.
Precomputed datasets
On demand processing is not viable under these constraints.
One day of observation generates around:
- 5 million data points
Processing this during a request would:
- increase latency
- consume too much memory
- destabilize the system
The system generates daily datasets in advance.
Each dataset is:
- processed
- consolidated
- stored in a ready to serve format
Datasets are available at:
https://www.macksun.org
Structure and format are documented here:
https://www.macksun.org/pages/wiki/arquivos-telescopios.html
We chose precomputed datasets instead of on demand processing. This reduces flexibility, but ensures consistent performance.
Trade offs
This architecture makes explicit decisions.
Real time vs stability
No real time processing
Predictable execution
Flexibility vs predictability
No arbitrary queries over raw data
Structured access through prepared datasets
Infrastructure vs engineering
No hardware scaling
More control over data and processing
We chose sharding on a single server. This is not the typical approach, but it was experimentally validated.
We chose precomputation instead of real time processing. This reduces flexibility, but guarantees stability.
Why this works
The system works because it enforces limits.
- collections are bounded
- memory usage is controlled
- ingestion and access are separated
- heavy processing is done in advance
Instead of relying on infrastructure scaling, the system relies on controlled behavior.
Final thoughts
MackSun shows that it is possible to process billions of records without HPC, but only if constraints are treated as part of the design.
This requires:
- strict partitioning
- controlled ingestion
- precomputed outputs
- disciplined resource usage
Explore the datasets and see how MackSun handles billions of records under constrained hardware:
Top comments (0)