Enhance aggregator functionality for Nostr event collection
- Updated the aggregator to support both public (npub) and private (nsec) key inputs for event searching, enabling authentication for relays that require it.
- Implemented bloom filter loading and appending capabilities for efficient incremental data collection.
- Added timeout parameters for maximum runtime and stuck progress detection to improve reliability.
- Enhanced README with detailed usage instructions, authentication behavior, and examples for incremental collection.
- Bumped version to v0.17.16.
@ -5,45 +5,129 @@ A comprehensive program that searches for all events related to a specific npub
@@ -5,45 +5,129 @@ A comprehensive program that searches for all events related to a specific npub
## Usage
```bash
go run main.go -npub <npub> [-since <timestamp>] [-until <timestamp>]
go run main.go -key <nsec|npub> [-since <timestamp>] [-until <timestamp>] [-filter <file>] [-output <file>]
```
Where:
- `<npub>` is a bech32-encoded Nostr public key (starting with "npub1")
- `<nsec|npub>` is either a bech32-encoded Nostr private key (nsec1...) or public key (npub1...)
- `<timestamp>` is a Unix timestamp (seconds since epoch) - optional
- `<file>` is a file path for bloom filter input/output - optional
### Parameters
- **`-key`**: Required. The bech32-encoded Nostr key to search for events
- **nsec**: Private key (enables authentication to relays that require it)
- **npub**: Public key (authentication disabled)
- **`-since`**: Optional. Start timestamp (Unix seconds). Only events after this time
- **`-until`**: Optional. End timestamp (Unix seconds). Only events before this time
- **Persistent deduplication**: Bloom filter can be saved and reused across multiple runs
### Incremental Collection
- **Bloom filter persistence**: Save deduplication state between runs for efficient incremental collection
- **Automatic append mode**: When loading existing bloom filter, automatically appends to output file
- **Timestamp tracking**: Records actual time range of processed events in bloom filter output
- **Seamless continuation**: Resume collection from where previous run left off without duplicates
### Reliability & Performance
- Connects to multiple relays simultaneously with dynamic expansion
- Outputs events in JSONL format (one JSON object per line)
- Handles connection failures gracefully
- Continues running until all relay connections are closed
- Time-based filtering with Unix timestamps (since/until parameters)
- Input validation for timestamp ranges
- Rate limiting and backoff for relay connection management
## Event Discovery
@ -70,6 +154,61 @@ The aggregator uses an intelligent progressive backward fetching strategy:
@@ -70,6 +154,61 @@ The aggregator uses an intelligent progressive backward fetching strategy:
4. **Efficient processing**: Processes each time batch completely before moving to the next
5. **Boundary respect**: Stops when reaching the since timestamp or beginning of available data
## Incremental Collection Workflow
The aggregator supports efficient incremental data collection using persistent bloom filters. This allows you to build comprehensive event archives over time without re-processing duplicate events.
### How It Works
1. **First Run**: Creates a new bloom filter and collects events for the specified time range
2. **Bloom Filter Output**: At completion, outputs bloom filter summary to stderr with:
- Time range covered (actual timestamps of collected events)
- Base64-encoded bloom filter data for reuse
3. **Subsequent Runs**: Load the saved bloom filter to skip already-seen events
4. **Automatic Append**: When using an existing filter, new events are appended to the output file
### Bloom Filter Output Format
The bloom filter output includes comprehensive metadata:
```
=== BLOOM FILTER SUMMARY ===
Events processed: 1247
Estimated unique events: 1247
Bloom filter size: 1.75 MB
False positive rate: ~0.1%
Hash functions: 10
Time range covered: 1640995200 to 1672531200
Time range (human): 2022-01-01T00:00:00Z to 2023-01-01T00:00:00Z
Bloom filter (base64):
[base64-encoded binary data]
=== END BLOOM FILTER ===
```
### Best Practices
- **Save bloom filters**: Always redirect stderr to a file to preserve the bloom filter
- **Sequential time ranges**: Use non-overlapping time ranges for optimal efficiency
- **Regular updates**: Update your bloom filter file after each run for the latest state
- **Backup filters**: Keep copies of bloom filter files for different time periods
### Example Workflow
```bash
# Month 1: January 2022 (using npub for public relays)
go run main.go -key npub1... -since 1640995200 -until 1643673600 -output jan2022.jsonl 2>filter_jan.txt
# Month 2: February 2022 (using nsec for auth-required relays, append to same file)
go run main.go -key nsec1... -since 1643673600 -until 1646092800 -filter filter_jan.txt -output all_events.jsonl 2>filter_feb.txt
# Month 3: March 2022 (continue with authentication for complete coverage)
go run main.go -key nsec1... -since 1646092800 -until 1648771200 -filter filter_feb.txt -output all_events.jsonl 2>filter_mar.txt
# Result: all_events.jsonl contains deduplicated events from all three months, including private relay content
```
## Memory Management
The aggregator uses advanced memory management techniques to handle large-scale data collection:
@ -108,6 +247,8 @@ The program starts with the following initial relays:
@@ -108,6 +247,8 @@ The program starts with the following initial relays:
## Output Format
### Event Output (stdout or -output file)
Each line of output is a JSON object representing a Nostr event with the following fields:
- `id`: Event ID (hex)
@ -117,3 +258,32 @@ Each line of output is a JSON object representing a Nostr event with the followi
@@ -117,3 +258,32 @@ Each line of output is a JSON object representing a Nostr event with the followi
- `tags`: Array of tag arrays
- `content`: Event content string
- `sig`: Event signature (hex)
### Bloom Filter Output (stderr)
At program completion, a comprehensive bloom filter summary is written to stderr containing:
- **Binary Data**: Base64-encoded bloom filter for reuse in subsequent runs
The bloom filter output is structured with clear markers (`=== BLOOM FILTER SUMMARY ===` and `=== END BLOOM FILTER ===`) making it easy to parse and extract the base64 data programmatically.
### Output Separation
- **Events**: Always go to stdout (default) or the file specified by `-output`
- **Bloom Filter**: Always goes to stderr, allowing separate redirection
- **Logs**: Runtime information and progress updates go to stderr
This separation allows flexible output handling:
```bash
# Events to file, bloom filter visible in terminal