Browse Source
- Deleted obsolete files including ALL_FIXES.md, MESSAGE_QUEUE_FIX.md, PUBLISHER_FIX.md, and others to streamline the codebase. - Implemented critical fixes for subscription stability, ensuring receiver channels are consumed and preventing drops. - Introduced per-subscription consumer goroutines to enhance event delivery and prevent message queue overflow. - Updated documentation to reflect changes and provide clear testing guidelines for subscription stability. - Bumped version to v0.26.3 to signify these important updates.main
18 changed files with 177 additions and 2647 deletions
@ -1,169 +0,0 @@ |
|||||||
# Critical Publisher Bug Fix |
|
||||||
|
|
||||||
## Issue Discovered |
|
||||||
|
|
||||||
Events were being published successfully but **never delivered to subscribers**. The test showed: |
|
||||||
- Publisher logs: "saved event" |
|
||||||
- Subscriber logs: No events received |
|
||||||
- No delivery timeouts or errors |
|
||||||
|
|
||||||
## Root Cause |
|
||||||
|
|
||||||
The `Subscription` struct in `app/publisher.go` was missing the `Receiver` field: |
|
||||||
|
|
||||||
```go |
|
||||||
// BEFORE - Missing Receiver field |
|
||||||
type Subscription struct { |
|
||||||
remote string |
|
||||||
AuthedPubkey []byte |
|
||||||
*filter.S |
|
||||||
} |
|
||||||
``` |
|
||||||
|
|
||||||
This meant: |
|
||||||
1. Subscriptions were registered with receiver channels in `handle-req.go` |
|
||||||
2. Publisher stored subscriptions but **NEVER stored the receiver channels** |
|
||||||
3. Consumer goroutines waited on receiver channels |
|
||||||
4. Publisher's `Deliver()` tried to send directly to write channels (bypassing consumers) |
|
||||||
5. Events never reached the consumer goroutines → never delivered to clients |
|
||||||
|
|
||||||
## The Architecture (How it Should Work) |
|
||||||
|
|
||||||
``` |
|
||||||
Event Published |
|
||||||
↓ |
|
||||||
Publisher.Deliver() matches filters |
|
||||||
↓ |
|
||||||
Sends event to Subscription.Receiver channel ← THIS WAS MISSING |
|
||||||
↓ |
|
||||||
Consumer goroutine reads from Receiver |
|
||||||
↓ |
|
||||||
Formats as EVENT envelope |
|
||||||
↓ |
|
||||||
Sends to write channel |
|
||||||
↓ |
|
||||||
Write worker sends to client |
|
||||||
``` |
|
||||||
|
|
||||||
## The Fix |
|
||||||
|
|
||||||
### 1. Add Receiver Field to Subscription Struct |
|
||||||
|
|
||||||
**File**: `app/publisher.go:29-34` |
|
||||||
|
|
||||||
```go |
|
||||||
// AFTER - With Receiver field |
|
||||||
type Subscription struct { |
|
||||||
remote string |
|
||||||
AuthedPubkey []byte |
|
||||||
Receiver event.C // Channel for delivering events to this subscription |
|
||||||
*filter.S |
|
||||||
} |
|
||||||
``` |
|
||||||
|
|
||||||
### 2. Store Receiver When Registering Subscription |
|
||||||
|
|
||||||
**File**: `app/publisher.go:125,130` |
|
||||||
|
|
||||||
```go |
|
||||||
// BEFORE |
|
||||||
subs[m.Id] = Subscription{ |
|
||||||
S: m.Filters, remote: m.remote, AuthedPubkey: m.AuthedPubkey, |
|
||||||
} |
|
||||||
|
|
||||||
// AFTER |
|
||||||
subs[m.Id] = Subscription{ |
|
||||||
S: m.Filters, remote: m.remote, AuthedPubkey: m.AuthedPubkey, Receiver: m.Receiver, |
|
||||||
} |
|
||||||
``` |
|
||||||
|
|
||||||
### 3. Send Events to Receiver Channel (Not Write Channel) |
|
||||||
|
|
||||||
**File**: `app/publisher.go:242-266` |
|
||||||
|
|
||||||
```go |
|
||||||
// BEFORE - Tried to format and send directly to write channel |
|
||||||
var res *eventenvelope.Result |
|
||||||
if res, err = eventenvelope.NewResultWith(d.id, ev); chk.E(err) { |
|
||||||
// ... |
|
||||||
} |
|
||||||
msgData := res.Marshal(nil) |
|
||||||
writeChan <- publish.WriteRequest{Data: msgData, MsgType: websocket.TextMessage} |
|
||||||
|
|
||||||
// AFTER - Send raw event to receiver channel |
|
||||||
if d.sub.Receiver == nil { |
|
||||||
log.E.F("subscription %s has nil receiver channel", d.id) |
|
||||||
continue |
|
||||||
} |
|
||||||
|
|
||||||
select { |
|
||||||
case d.sub.Receiver <- ev: |
|
||||||
log.D.F("subscription delivery QUEUED: event=%s to=%s sub=%s", |
|
||||||
hex.Enc(ev.ID), d.sub.remote, d.id) |
|
||||||
case <-time.After(DefaultWriteTimeout): |
|
||||||
log.E.F("subscription delivery TIMEOUT: event=%s to=%s sub=%s", |
|
||||||
hex.Enc(ev.ID), d.sub.remote, d.id) |
|
||||||
} |
|
||||||
``` |
|
||||||
|
|
||||||
## Why This Pattern Matters (khatru Architecture) |
|
||||||
|
|
||||||
The khatru pattern uses **per-subscription consumer goroutines** for good reasons: |
|
||||||
|
|
||||||
1. **Separation of Concerns**: Publisher just matches filters and sends to channels |
|
||||||
2. **Formatting Isolation**: Each consumer formats events for its specific subscription |
|
||||||
3. **Backpressure Handling**: Channel buffers naturally throttle fast publishers |
|
||||||
4. **Clean Cancellation**: Context cancels consumer goroutine, channel cleanup is automatic |
|
||||||
5. **No Lock Contention**: Publisher doesn't hold locks during I/O operations |
|
||||||
|
|
||||||
## Files Modified |
|
||||||
|
|
||||||
| File | Lines | Change | |
|
||||||
|------|-------|--------| |
|
||||||
| `app/publisher.go` | 32 | Add `Receiver event.C` field to Subscription | |
|
||||||
| `app/publisher.go` | 125, 130 | Store Receiver when registering | |
|
||||||
| `app/publisher.go` | 242-266 | Send to receiver channel instead of write channel | |
|
||||||
| `app/publisher.go` | 3-19 | Remove unused imports (chk, eventenvelope) | |
|
||||||
|
|
||||||
## Testing |
|
||||||
|
|
||||||
```bash |
|
||||||
# Terminal 1: Start relay |
|
||||||
./orly |
|
||||||
|
|
||||||
# Terminal 2: Subscribe |
|
||||||
websocat ws://localhost:3334 <<< '["REQ","test",{"kinds":[1]}]' |
|
||||||
|
|
||||||
# Terminal 3: Publish event |
|
||||||
websocat ws://localhost:3334 <<< '["EVENT",{"kind":1,"content":"test",...}]' |
|
||||||
``` |
|
||||||
|
|
||||||
**Expected**: Terminal 2 receives the event immediately |
|
||||||
|
|
||||||
## Impact |
|
||||||
|
|
||||||
**Before:** |
|
||||||
- ❌ No events delivered to subscribers |
|
||||||
- ❌ Publisher tried to bypass consumer goroutines |
|
||||||
- ❌ Consumer goroutines blocked forever waiting on receiver channels |
|
||||||
- ❌ Architecture didn't follow khatru pattern |
|
||||||
|
|
||||||
**After:** |
|
||||||
- ✅ Events delivered via receiver channels |
|
||||||
- ✅ Consumer goroutines receive and format events |
|
||||||
- ✅ Full khatru pattern implementation |
|
||||||
- ✅ Proper separation of concerns |
|
||||||
|
|
||||||
## Summary |
|
||||||
|
|
||||||
The subscription stability fixes in the previous work correctly implemented: |
|
||||||
- Per-subscription consumer goroutines ✅ |
|
||||||
- Independent contexts ✅ |
|
||||||
- Concurrent message processing ✅ |
|
||||||
|
|
||||||
But the publisher was never connected to the consumer goroutines! This fix completes the implementation by: |
|
||||||
- Storing receiver channels in subscriptions ✅ |
|
||||||
- Sending events to receiver channels ✅ |
|
||||||
- Letting consumers handle formatting and delivery ✅ |
|
||||||
|
|
||||||
**Result**: Events now flow correctly from publisher → receiver channel → consumer → client |
|
||||||
@ -1,75 +0,0 @@ |
|||||||
# Quick Start - Subscription Stability Testing |
|
||||||
|
|
||||||
## TL;DR |
|
||||||
|
|
||||||
Subscriptions were dropping. Now they're fixed. Here's how to verify: |
|
||||||
|
|
||||||
## 1. Build Everything |
|
||||||
|
|
||||||
```bash |
|
||||||
go build -o orly |
|
||||||
go build -o subscription-test ./cmd/subscription-test |
|
||||||
``` |
|
||||||
|
|
||||||
## 2. Test It |
|
||||||
|
|
||||||
```bash |
|
||||||
# Terminal 1: Start relay |
|
||||||
./orly |
|
||||||
|
|
||||||
# Terminal 2: Run test |
|
||||||
./subscription-test -url ws://localhost:3334 -duration 60 -v |
|
||||||
``` |
|
||||||
|
|
||||||
## 3. Expected Output |
|
||||||
|
|
||||||
``` |
|
||||||
✓ Connected |
|
||||||
✓ Received EOSE - subscription is active |
|
||||||
|
|
||||||
Waiting for real-time events... |
|
||||||
|
|
||||||
[EVENT #1] id=abc123... kind=1 created=1234567890 |
|
||||||
[EVENT #2] id=def456... kind=1 created=1234567891 |
|
||||||
... |
|
||||||
|
|
||||||
[STATUS] Elapsed: 30s/60s | Events: 15 | Last event: 2s ago |
|
||||||
[STATUS] Elapsed: 60s/60s | Events: 30 | Last event: 1s ago |
|
||||||
|
|
||||||
✓ TEST PASSED - Subscription remained stable |
|
||||||
``` |
|
||||||
|
|
||||||
## What Changed? |
|
||||||
|
|
||||||
**Before:** Subscriptions dropped after ~30-60 seconds |
|
||||||
**After:** Subscriptions stay active indefinitely |
|
||||||
|
|
||||||
## Key Files Modified |
|
||||||
|
|
||||||
- `app/listener.go` - Added subscription tracking |
|
||||||
- `app/handle-req.go` - Consumer goroutines per subscription |
|
||||||
- `app/handle-close.go` - Proper cleanup |
|
||||||
- `app/handle-websocket.go` - Cancel all subs on disconnect |
|
||||||
|
|
||||||
## Why Did It Break? |
|
||||||
|
|
||||||
Receiver channels were created but never consumed → filled up → publisher timeout → subscription removed |
|
||||||
|
|
||||||
## How Is It Fixed? |
|
||||||
|
|
||||||
Each subscription now has a goroutine that continuously reads from its channel and forwards events to the client (khatru pattern). |
|
||||||
|
|
||||||
## More Info |
|
||||||
|
|
||||||
- **Technical details:** [SUBSCRIPTION_STABILITY_FIXES.md](SUBSCRIPTION_STABILITY_FIXES.md) |
|
||||||
- **Full testing guide:** [TESTING_GUIDE.md](TESTING_GUIDE.md) |
|
||||||
- **Complete summary:** [SUMMARY.md](SUMMARY.md) |
|
||||||
|
|
||||||
## Questions? |
|
||||||
|
|
||||||
```bash |
|
||||||
./subscription-test -h # Test tool help |
|
||||||
export ORLY_LOG_LEVEL=debug # Enable debug logs |
|
||||||
``` |
|
||||||
|
|
||||||
That's it! 🎉 |
|
||||||
@ -1,371 +0,0 @@ |
|||||||
# WebSocket Subscription Stability Fixes |
|
||||||
|
|
||||||
## Executive Summary |
|
||||||
|
|
||||||
This document describes critical fixes applied to resolve subscription drop issues in the ORLY Nostr relay. The primary issue was **receiver channels were created but never consumed**, causing subscriptions to appear "dead" after a short period. |
|
||||||
|
|
||||||
## Root Causes Identified |
|
||||||
|
|
||||||
### 1. **Missing Receiver Channel Consumer** (Critical) |
|
||||||
**Location:** [app/handle-req.go:616](app/handle-req.go#L616) |
|
||||||
|
|
||||||
**Problem:** |
|
||||||
- `HandleReq` created a receiver channel: `receiver := make(event.C, 32)` |
|
||||||
- This channel was passed to the publisher but **never consumed** |
|
||||||
- When events were published, the channel filled up (32-event buffer) |
|
||||||
- Publisher attempts to send timed out after 3 seconds |
|
||||||
- Publisher assumed connection was dead and removed subscription |
|
||||||
|
|
||||||
**Impact:** Subscriptions dropped after receiving ~32 events or after inactivity timeout. |
|
||||||
|
|
||||||
### 2. **No Independent Subscription Context** |
|
||||||
**Location:** [app/handle-req.go](app/handle-req.go) |
|
||||||
|
|
||||||
**Problem:** |
|
||||||
- Subscriptions used the listener's connection context directly |
|
||||||
- If the query context was cancelled (timeout, error), it affected active subscriptions |
|
||||||
- No way to independently cancel individual subscriptions |
|
||||||
- Similar to khatru, each subscription needs its own context hierarchy |
|
||||||
|
|
||||||
**Impact:** Query timeouts or errors could inadvertently cancel active subscriptions. |
|
||||||
|
|
||||||
### 3. **Incomplete Subscription Cleanup** |
|
||||||
**Location:** [app/handle-close.go](app/handle-close.go) |
|
||||||
|
|
||||||
**Problem:** |
|
||||||
- `HandleClose` sent cancel signal to publisher |
|
||||||
- But didn't close receiver channels or stop consumer goroutines |
|
||||||
- Led to goroutine leaks and channel leaks |
|
||||||
|
|
||||||
**Impact:** Memory leaks over time, especially with many short-lived subscriptions. |
|
||||||
|
|
||||||
## Solutions Implemented |
|
||||||
|
|
||||||
### 1. Per-Subscription Consumer Goroutines |
|
||||||
|
|
||||||
**Added in [app/handle-req.go:644-688](app/handle-req.go#L644-L688):** |
|
||||||
|
|
||||||
```go |
|
||||||
// Launch goroutine to consume from receiver channel and forward to client |
|
||||||
go func() { |
|
||||||
defer func() { |
|
||||||
// Clean up when subscription ends |
|
||||||
l.subscriptionsMu.Lock() |
|
||||||
delete(l.subscriptions, subID) |
|
||||||
l.subscriptionsMu.Unlock() |
|
||||||
log.D.F("subscription goroutine exiting for %s @ %s", subID, l.remote) |
|
||||||
}() |
|
||||||
|
|
||||||
for { |
|
||||||
select { |
|
||||||
case <-subCtx.Done(): |
|
||||||
// Subscription cancelled (CLOSE message or connection closing) |
|
||||||
return |
|
||||||
case ev, ok := <-receiver: |
|
||||||
if !ok { |
|
||||||
// Channel closed - subscription ended |
|
||||||
return |
|
||||||
} |
|
||||||
|
|
||||||
// Forward event to client via write channel |
|
||||||
var res *eventenvelope.Result |
|
||||||
var err error |
|
||||||
if res, err = eventenvelope.NewResultWith(subID, ev); chk.E(err) { |
|
||||||
continue |
|
||||||
} |
|
||||||
|
|
||||||
// Write to client - this goes through the write worker |
|
||||||
if err = res.Write(l); err != nil { |
|
||||||
if !strings.Contains(err.Error(), "context canceled") { |
|
||||||
log.E.F("failed to write event to subscription %s @ %s: %v", subID, l.remote, err) |
|
||||||
} |
|
||||||
continue |
|
||||||
} |
|
||||||
|
|
||||||
log.D.F("delivered real-time event %s to subscription %s @ %s", |
|
||||||
hexenc.Enc(ev.ID), subID, l.remote) |
|
||||||
} |
|
||||||
} |
|
||||||
}() |
|
||||||
``` |
|
||||||
|
|
||||||
**Benefits:** |
|
||||||
- Events are continuously consumed from receiver channel |
|
||||||
- Channel never fills up |
|
||||||
- Publisher can always send without timeout |
|
||||||
- Clean shutdown when subscription is cancelled |
|
||||||
|
|
||||||
### 2. Independent Subscription Contexts |
|
||||||
|
|
||||||
**Added in [app/handle-req.go:621-627](app/handle-req.go#L621-L627):** |
|
||||||
|
|
||||||
```go |
|
||||||
// Create a dedicated context for this subscription that's independent of query context |
|
||||||
// but is child of the listener context so it gets cancelled when connection closes |
|
||||||
subCtx, subCancel := context.WithCancel(l.ctx) |
|
||||||
|
|
||||||
// Track this subscription so we can cancel it on CLOSE or connection close |
|
||||||
subID := string(env.Subscription) |
|
||||||
l.subscriptionsMu.Lock() |
|
||||||
l.subscriptions[subID] = subCancel |
|
||||||
l.subscriptionsMu.Unlock() |
|
||||||
``` |
|
||||||
|
|
||||||
**Added subscription tracking to Listener struct [app/listener.go:46-47](app/listener.go#L46-L47):** |
|
||||||
|
|
||||||
```go |
|
||||||
// Subscription tracking for cleanup |
|
||||||
subscriptions map[string]context.CancelFunc // Map of subscription ID to cancel function |
|
||||||
subscriptionsMu sync.Mutex // Protects subscriptions map |
|
||||||
``` |
|
||||||
|
|
||||||
**Benefits:** |
|
||||||
- Each subscription has independent lifecycle |
|
||||||
- Query timeouts don't affect active subscriptions |
|
||||||
- Clean cancellation via context pattern |
|
||||||
- Follows khatru's proven architecture |
|
||||||
|
|
||||||
### 3. Proper Subscription Cleanup |
|
||||||
|
|
||||||
**Updated [app/handle-close.go:29-48](app/handle-close.go#L29-L48):** |
|
||||||
|
|
||||||
```go |
|
||||||
subID := string(env.ID) |
|
||||||
|
|
||||||
// Cancel the subscription goroutine by calling its cancel function |
|
||||||
l.subscriptionsMu.Lock() |
|
||||||
if cancelFunc, exists := l.subscriptions[subID]; exists { |
|
||||||
log.D.F("cancelling subscription %s for %s", subID, l.remote) |
|
||||||
cancelFunc() |
|
||||||
delete(l.subscriptions, subID) |
|
||||||
} else { |
|
||||||
log.D.F("subscription %s not found for %s (already closed?)", subID, l.remote) |
|
||||||
} |
|
||||||
l.subscriptionsMu.Unlock() |
|
||||||
|
|
||||||
// Also remove from publisher's tracking |
|
||||||
l.publishers.Receive( |
|
||||||
&W{ |
|
||||||
Cancel: true, |
|
||||||
remote: l.remote, |
|
||||||
Conn: l.conn, |
|
||||||
Id: subID, |
|
||||||
}, |
|
||||||
) |
|
||||||
``` |
|
||||||
|
|
||||||
**Updated connection cleanup in [app/handle-websocket.go:136-143](app/handle-websocket.go#L136-L143):** |
|
||||||
|
|
||||||
```go |
|
||||||
// Cancel all active subscriptions first |
|
||||||
listener.subscriptionsMu.Lock() |
|
||||||
for subID, cancelFunc := range listener.subscriptions { |
|
||||||
log.D.F("cancelling subscription %s for %s", subID, remote) |
|
||||||
cancelFunc() |
|
||||||
} |
|
||||||
listener.subscriptions = nil |
|
||||||
listener.subscriptionsMu.Unlock() |
|
||||||
``` |
|
||||||
|
|
||||||
**Benefits:** |
|
||||||
- Subscriptions properly cancelled on CLOSE message |
|
||||||
- All subscriptions cancelled when connection closes |
|
||||||
- No goroutine or channel leaks |
|
||||||
- Clean resource management |
|
||||||
|
|
||||||
## Architecture Comparison: ORLY vs khatru |
|
||||||
|
|
||||||
### Before (Broken) |
|
||||||
``` |
|
||||||
REQ → Create receiver channel → Register with publisher → Done |
|
||||||
↓ |
|
||||||
Events published → Try to send to receiver → TIMEOUT (channel full) |
|
||||||
↓ |
|
||||||
Remove subscription |
|
||||||
``` |
|
||||||
|
|
||||||
### After (Fixed, khatru-style) |
|
||||||
``` |
|
||||||
REQ → Create receiver channel → Register with publisher → Launch consumer goroutine |
|
||||||
↓ ↓ |
|
||||||
Events published → Send to receiver ──────────────→ Consumer reads → Forward to client |
|
||||||
(never blocks) (continuous) |
|
||||||
``` |
|
||||||
|
|
||||||
### Key khatru Patterns Adopted |
|
||||||
|
|
||||||
1. **Dual-context architecture:** |
|
||||||
- Connection context (`l.ctx`) - cancelled when connection closes |
|
||||||
- Per-subscription context (`subCtx`) - cancelled on CLOSE or connection close |
|
||||||
|
|
||||||
2. **Consumer goroutine per subscription:** |
|
||||||
- Dedicated goroutine reads from receiver channel |
|
||||||
- Forwards events to write channel |
|
||||||
- Clean shutdown via context cancellation |
|
||||||
|
|
||||||
3. **Subscription tracking:** |
|
||||||
- Map of subscription ID → cancel function |
|
||||||
- Enables targeted cancellation |
|
||||||
- Clean bulk cancellation on disconnect |
|
||||||
|
|
||||||
4. **Write serialization:** |
|
||||||
- Already implemented correctly with write worker |
|
||||||
- Single goroutine handles all writes |
|
||||||
- Prevents concurrent write panics |
|
||||||
|
|
||||||
## Testing |
|
||||||
|
|
||||||
### Manual Testing Recommendations |
|
||||||
|
|
||||||
1. **Long-running subscription test:** |
|
||||||
```bash |
|
||||||
# Terminal 1: Start relay |
|
||||||
./orly |
|
||||||
|
|
||||||
# Terminal 2: Connect and subscribe |
|
||||||
websocat ws://localhost:3334 |
|
||||||
["REQ","test",{"kinds":[1]}] |
|
||||||
|
|
||||||
# Terminal 3: Publish events periodically |
|
||||||
for i in {1..100}; do |
|
||||||
# Publish event via your preferred method |
|
||||||
sleep 10 |
|
||||||
done |
|
||||||
``` |
|
||||||
|
|
||||||
**Expected:** All 100 events should be received by the subscriber. |
|
||||||
|
|
||||||
2. **Multiple subscriptions test:** |
|
||||||
```bash |
|
||||||
# Connect once, create multiple subscriptions |
|
||||||
["REQ","sub1",{"kinds":[1]}] |
|
||||||
["REQ","sub2",{"kinds":[3]}] |
|
||||||
["REQ","sub3",{"kinds":[7]}] |
|
||||||
|
|
||||||
# Publish events of different kinds |
|
||||||
# Verify each subscription receives only its kind |
|
||||||
``` |
|
||||||
|
|
||||||
3. **Subscription closure test:** |
|
||||||
```bash |
|
||||||
["REQ","test",{"kinds":[1]}] |
|
||||||
# Wait for EOSE |
|
||||||
["CLOSE","test"] |
|
||||||
|
|
||||||
# Publish more kind 1 events |
|
||||||
# Verify no events are received after CLOSE |
|
||||||
``` |
|
||||||
|
|
||||||
### Automated Tests |
|
||||||
|
|
||||||
See [app/subscription_stability_test.go](app/subscription_stability_test.go) for comprehensive test suite: |
|
||||||
- `TestLongRunningSubscriptionStability` - 30-second subscription with events published every second |
|
||||||
- `TestMultipleConcurrentSubscriptions` - Multiple subscriptions on same connection |
|
||||||
|
|
||||||
## Performance Implications |
|
||||||
|
|
||||||
### Resource Usage |
|
||||||
|
|
||||||
**Before:** |
|
||||||
- Memory leak: ~100 bytes per abandoned subscription goroutine |
|
||||||
- Channel leak: ~32 events × ~5KB each = ~160KB per subscription |
|
||||||
- CPU: Wasted cycles on timeout retries in publisher |
|
||||||
|
|
||||||
**After:** |
|
||||||
- Clean goroutine shutdown: 0 leaks |
|
||||||
- Channels properly closed: 0 leaks |
|
||||||
- CPU: No wasted timeout retries |
|
||||||
|
|
||||||
### Scalability |
|
||||||
|
|
||||||
**Before:** |
|
||||||
- Max ~32 events per subscription before issues |
|
||||||
- Frequent subscription churn as they drop and reconnect |
|
||||||
- Publisher timeout overhead on every event broadcast |
|
||||||
|
|
||||||
**After:** |
|
||||||
- Unlimited events per subscription |
|
||||||
- Stable long-running subscriptions (hours/days) |
|
||||||
- Fast event delivery (no timeouts) |
|
||||||
|
|
||||||
## Monitoring Recommendations |
|
||||||
|
|
||||||
Add metrics to track subscription health: |
|
||||||
|
|
||||||
```go |
|
||||||
// In Server struct |
|
||||||
type SubscriptionMetrics struct { |
|
||||||
ActiveSubscriptions atomic.Int64 |
|
||||||
TotalSubscriptions atomic.Int64 |
|
||||||
SubscriptionDrops atomic.Int64 |
|
||||||
EventsDelivered atomic.Int64 |
|
||||||
DeliveryTimeouts atomic.Int64 |
|
||||||
} |
|
||||||
``` |
|
||||||
|
|
||||||
Log these metrics periodically to detect regressions. |
|
||||||
|
|
||||||
## Migration Notes |
|
||||||
|
|
||||||
### Compatibility |
|
||||||
|
|
||||||
These changes are **100% backward compatible**: |
|
||||||
- Wire protocol unchanged |
|
||||||
- Client behavior unchanged |
|
||||||
- Database schema unchanged |
|
||||||
- Configuration unchanged |
|
||||||
|
|
||||||
### Deployment |
|
||||||
|
|
||||||
1. Build with Go 1.21+ |
|
||||||
2. Deploy as normal (no special steps) |
|
||||||
3. Restart relay |
|
||||||
4. Existing connections will be dropped (as expected with restart) |
|
||||||
5. New connections will use fixed subscription handling |
|
||||||
|
|
||||||
### Rollback |
|
||||||
|
|
||||||
If issues arise, revert commits: |
|
||||||
```bash |
|
||||||
git revert <commit-hash> |
|
||||||
go build -o orly |
|
||||||
``` |
|
||||||
|
|
||||||
Old behavior will be restored. |
|
||||||
|
|
||||||
## Related Issues |
|
||||||
|
|
||||||
This fix resolves several related symptoms: |
|
||||||
- Subscriptions dropping after ~1 minute |
|
||||||
- Subscriptions receiving only first N events then stopping |
|
||||||
- Publisher timing out when broadcasting events |
|
||||||
- Goroutine leaks growing over time |
|
||||||
- Memory usage growing with subscription count |
|
||||||
|
|
||||||
## References |
|
||||||
|
|
||||||
- **khatru relay:** https://github.com/fiatjaf/khatru |
|
||||||
- **RFC 6455 WebSocket Protocol:** https://tools.ietf.org/html/rfc6455 |
|
||||||
- **NIP-01 Basic Protocol:** https://github.com/nostr-protocol/nips/blob/master/01.md |
|
||||||
- **WebSocket skill documentation:** [.claude/skills/nostr-websocket](.claude/skills/nostr-websocket) |
|
||||||
|
|
||||||
## Code Locations |
|
||||||
|
|
||||||
All changes are in these files: |
|
||||||
- [app/listener.go](app/listener.go) - Added subscription tracking fields |
|
||||||
- [app/handle-websocket.go](app/handle-websocket.go) - Initialize fields, cancel all on close |
|
||||||
- [app/handle-req.go](app/handle-req.go) - Launch consumer goroutines, track subscriptions |
|
||||||
- [app/handle-close.go](app/handle-close.go) - Cancel specific subscriptions |
|
||||||
- [app/subscription_stability_test.go](app/subscription_stability_test.go) - Test suite (new file) |
|
||||||
|
|
||||||
## Conclusion |
|
||||||
|
|
||||||
The subscription stability issues were caused by a fundamental architectural flaw: **receiver channels without consumers**. By adopting khatru's proven pattern of per-subscription consumer goroutines with independent contexts, we've achieved: |
|
||||||
|
|
||||||
✅ Unlimited subscription lifetime |
|
||||||
✅ No event delivery timeouts |
|
||||||
✅ No resource leaks |
|
||||||
✅ Clean subscription lifecycle |
|
||||||
✅ Backward compatible |
|
||||||
|
|
||||||
The relay should now handle long-running subscriptions as reliably as khatru does in production. |
|
||||||
@ -1,229 +0,0 @@ |
|||||||
# Subscription Stability Refactoring - Summary |
|
||||||
|
|
||||||
## Overview |
|
||||||
|
|
||||||
Successfully refactored WebSocket and subscription handling following khatru patterns to fix critical stability issues that caused subscriptions to drop after a short period. |
|
||||||
|
|
||||||
## Problem Identified |
|
||||||
|
|
||||||
**Root Cause:** Receiver channels were created but never consumed, causing: |
|
||||||
- Channels to fill up after 32 events (buffer limit) |
|
||||||
- Publisher timeouts when trying to send to full channels |
|
||||||
- Subscriptions being removed as "dead" connections |
|
||||||
- Events not delivered to active subscriptions |
|
||||||
|
|
||||||
## Solution Implemented |
|
||||||
|
|
||||||
Adopted khatru's proven architecture: |
|
||||||
|
|
||||||
1. **Per-subscription consumer goroutines** - Each subscription has a dedicated goroutine that continuously reads from its receiver channel and forwards events to the client |
|
||||||
|
|
||||||
2. **Independent subscription contexts** - Each subscription has its own cancellable context, preventing query timeouts from affecting active subscriptions |
|
||||||
|
|
||||||
3. **Proper lifecycle management** - Clean cancellation and cleanup on CLOSE messages and connection termination |
|
||||||
|
|
||||||
4. **Subscription tracking** - Map of subscription ID to cancel function for targeted cleanup |
|
||||||
|
|
||||||
## Files Changed |
|
||||||
|
|
||||||
- **[app/listener.go](app/listener.go)** - Added subscription tracking fields |
|
||||||
- **[app/handle-websocket.go](app/handle-websocket.go)** - Initialize subscription map, cancel all on close |
|
||||||
- **[app/handle-req.go](app/handle-req.go)** - Launch consumer goroutines for each subscription |
|
||||||
- **[app/handle-close.go](app/handle-close.go)** - Cancel specific subscriptions properly |
|
||||||
|
|
||||||
## New Tools Created |
|
||||||
|
|
||||||
### 1. Subscription Test Tool |
|
||||||
**Location:** `cmd/subscription-test/main.go` |
|
||||||
|
|
||||||
Native Go WebSocket client for testing subscription stability (no external dependencies like websocat). |
|
||||||
|
|
||||||
**Usage:** |
|
||||||
```bash |
|
||||||
./subscription-test -url ws://localhost:3334 -duration 60 -kind 1 |
|
||||||
``` |
|
||||||
|
|
||||||
**Features:** |
|
||||||
- Connects to relay and subscribes to events |
|
||||||
- Monitors for subscription drops |
|
||||||
- Reports event delivery statistics |
|
||||||
- No glibc dependencies (pure Go) |
|
||||||
|
|
||||||
### 2. Test Scripts |
|
||||||
**Location:** `scripts/test-subscriptions.sh` |
|
||||||
|
|
||||||
Convenience wrapper for running subscription tests. |
|
||||||
|
|
||||||
### 3. Documentation |
|
||||||
- **[SUBSCRIPTION_STABILITY_FIXES.md](SUBSCRIPTION_STABILITY_FIXES.md)** - Detailed technical explanation |
|
||||||
- **[TESTING_GUIDE.md](TESTING_GUIDE.md)** - Comprehensive testing procedures |
|
||||||
- **[app/subscription_stability_test.go](app/subscription_stability_test.go)** - Go test suite (framework ready) |
|
||||||
|
|
||||||
## How to Test |
|
||||||
|
|
||||||
### Quick Test |
|
||||||
```bash |
|
||||||
# Terminal 1: Start relay |
|
||||||
./orly |
|
||||||
|
|
||||||
# Terminal 2: Run subscription test |
|
||||||
./subscription-test -url ws://localhost:3334 -duration 60 -v |
|
||||||
|
|
||||||
# Terminal 3: Publish events (your method) |
|
||||||
# The subscription test will show events being received |
|
||||||
``` |
|
||||||
|
|
||||||
### What Success Looks Like |
|
||||||
- ✅ Subscription receives EOSE immediately |
|
||||||
- ✅ Events delivered throughout entire test duration |
|
||||||
- ✅ No timeout errors in relay logs |
|
||||||
- ✅ Clean shutdown on Ctrl+C |
|
||||||
|
|
||||||
### What Failure Looked Like (Before Fix) |
|
||||||
- ❌ Events stop after ~32 events or ~30 seconds |
|
||||||
- ❌ "subscription delivery TIMEOUT" in logs |
|
||||||
- ❌ Subscriptions removed as "dead" |
|
||||||
|
|
||||||
## Architecture Comparison |
|
||||||
|
|
||||||
### Before (Broken) |
|
||||||
``` |
|
||||||
REQ → Create channel → Register → Wait for events |
|
||||||
↓ |
|
||||||
Events published → Try to send → TIMEOUT |
|
||||||
↓ |
|
||||||
Subscription removed |
|
||||||
``` |
|
||||||
|
|
||||||
### After (Fixed - khatru style) |
|
||||||
``` |
|
||||||
REQ → Create channel → Register → Launch consumer goroutine |
|
||||||
↓ |
|
||||||
Events published → Send to channel |
|
||||||
↓ |
|
||||||
Consumer reads → Forward to client |
|
||||||
(continuous) |
|
||||||
``` |
|
||||||
|
|
||||||
## Key Improvements |
|
||||||
|
|
||||||
| Aspect | Before | After | |
|
||||||
|--------|--------|-------| |
|
||||||
| Subscription lifetime | ~30-60 seconds | Unlimited (hours/days) | |
|
||||||
| Events per subscription | ~32 max | Unlimited | |
|
||||||
| Event delivery | Timeouts common | Always successful | |
|
||||||
| Resource leaks | Yes (goroutines, channels) | No leaks | |
|
||||||
| Multiple subscriptions | Interfered with each other | Independent | |
|
||||||
|
|
||||||
## Build Status |
|
||||||
|
|
||||||
✅ **All code compiles successfully** |
|
||||||
```bash |
|
||||||
go build -o orly # 26M binary |
|
||||||
go build -o subscription-test ./cmd/subscription-test # 7.8M binary |
|
||||||
``` |
|
||||||
|
|
||||||
## Performance Impact |
|
||||||
|
|
||||||
### Memory |
|
||||||
- **Per subscription:** ~10KB (goroutine stack + channel buffers) |
|
||||||
- **No leaks:** Goroutines and channels cleaned up properly |
|
||||||
|
|
||||||
### CPU |
|
||||||
- **Minimal:** Event-driven architecture, only active when events arrive |
|
||||||
- **No polling:** Uses select/channels for efficiency |
|
||||||
|
|
||||||
### Scalability |
|
||||||
- **Before:** Limited to ~1000 subscriptions due to leaks |
|
||||||
- **After:** Supports 10,000+ concurrent subscriptions |
|
||||||
|
|
||||||
## Backwards Compatibility |
|
||||||
|
|
||||||
✅ **100% Backward Compatible** |
|
||||||
- No wire protocol changes |
|
||||||
- No client changes required |
|
||||||
- No configuration changes needed |
|
||||||
- No database migrations required |
|
||||||
|
|
||||||
Existing clients will automatically benefit from improved stability. |
|
||||||
|
|
||||||
## Deployment |
|
||||||
|
|
||||||
1. **Build:** |
|
||||||
```bash |
|
||||||
go build -o orly |
|
||||||
``` |
|
||||||
|
|
||||||
2. **Deploy:** |
|
||||||
Replace existing binary with new one |
|
||||||
|
|
||||||
3. **Restart:** |
|
||||||
Restart relay service (existing connections will be dropped, new connections will use fixed code) |
|
||||||
|
|
||||||
4. **Verify:** |
|
||||||
Run subscription-test tool to confirm stability |
|
||||||
|
|
||||||
5. **Monitor:** |
|
||||||
Watch logs for "subscription delivery TIMEOUT" errors (should see none) |
|
||||||
|
|
||||||
## Monitoring |
|
||||||
|
|
||||||
### Key Metrics to Track |
|
||||||
|
|
||||||
**Positive indicators:** |
|
||||||
- "subscription X created and goroutine launched" |
|
||||||
- "delivered real-time event X to subscription Y" |
|
||||||
- "subscription delivery QUEUED" |
|
||||||
|
|
||||||
**Negative indicators (should not see):** |
|
||||||
- "subscription delivery TIMEOUT" |
|
||||||
- "removing failed subscriber connection" |
|
||||||
- "subscription goroutine exiting" (except on explicit CLOSE) |
|
||||||
|
|
||||||
### Log Levels |
|
||||||
|
|
||||||
```bash |
|
||||||
# For testing |
|
||||||
export ORLY_LOG_LEVEL=debug |
|
||||||
|
|
||||||
# For production |
|
||||||
export ORLY_LOG_LEVEL=info |
|
||||||
``` |
|
||||||
|
|
||||||
## Credits |
|
||||||
|
|
||||||
**Inspiration:** khatru relay by fiatjaf |
|
||||||
- GitHub: https://github.com/fiatjaf/khatru |
|
||||||
- Used as reference for WebSocket patterns |
|
||||||
- Proven architecture in production |
|
||||||
|
|
||||||
**Pattern:** Per-subscription consumer goroutines with independent contexts |
|
||||||
|
|
||||||
## Next Steps |
|
||||||
|
|
||||||
1. ✅ Code implemented and building |
|
||||||
2. ⏳ **Run manual tests** (see TESTING_GUIDE.md) |
|
||||||
3. ⏳ Deploy to staging environment |
|
||||||
4. ⏳ Monitor for 24 hours |
|
||||||
5. ⏳ Deploy to production |
|
||||||
|
|
||||||
## Support |
|
||||||
|
|
||||||
For issues or questions: |
|
||||||
|
|
||||||
1. Check [TESTING_GUIDE.md](TESTING_GUIDE.md) for testing procedures |
|
||||||
2. Review [SUBSCRIPTION_STABILITY_FIXES.md](SUBSCRIPTION_STABILITY_FIXES.md) for technical details |
|
||||||
3. Enable debug logging: `export ORLY_LOG_LEVEL=debug` |
|
||||||
4. Run subscription-test with `-v` flag for verbose output |
|
||||||
|
|
||||||
## Conclusion |
|
||||||
|
|
||||||
The subscription stability issues have been resolved by adopting khatru's proven WebSocket patterns. The relay now properly manages subscription lifecycles with: |
|
||||||
|
|
||||||
- ✅ Per-subscription consumer goroutines |
|
||||||
- ✅ Independent contexts per subscription |
|
||||||
- ✅ Clean resource management |
|
||||||
- ✅ No event delivery timeouts |
|
||||||
- ✅ Unlimited subscription lifetime |
|
||||||
|
|
||||||
**The relay is now ready for production use with stable, long-running subscriptions.** |
|
||||||
@ -1,300 +0,0 @@ |
|||||||
# Subscription Stability Testing Guide |
|
||||||
|
|
||||||
This guide explains how to test the subscription stability fixes. |
|
||||||
|
|
||||||
## Quick Test |
|
||||||
|
|
||||||
### 1. Start the Relay |
|
||||||
|
|
||||||
```bash |
|
||||||
# Build the relay with fixes |
|
||||||
go build -o orly |
|
||||||
|
|
||||||
# Start the relay |
|
||||||
./orly |
|
||||||
``` |
|
||||||
|
|
||||||
### 2. Run the Subscription Test |
|
||||||
|
|
||||||
In another terminal: |
|
||||||
|
|
||||||
```bash |
|
||||||
# Run the built-in test tool |
|
||||||
./subscription-test -url ws://localhost:3334 -duration 60 -kind 1 -v |
|
||||||
|
|
||||||
# Or use the helper script |
|
||||||
./scripts/test-subscriptions.sh |
|
||||||
``` |
|
||||||
|
|
||||||
### 3. Publish Events (While Test is Running) |
|
||||||
|
|
||||||
The subscription test will wait for events. You need to publish events while it's running to verify the subscription remains active. |
|
||||||
|
|
||||||
**Option A: Using the relay-tester tool (if available):** |
|
||||||
```bash |
|
||||||
go run cmd/relay-tester/main.go -url ws://localhost:3334 |
|
||||||
``` |
|
||||||
|
|
||||||
**Option B: Using your client application:** |
|
||||||
Publish events to the relay through your normal client workflow. |
|
||||||
|
|
||||||
**Option C: Manual WebSocket connection:** |
|
||||||
Use any WebSocket client to publish events: |
|
||||||
```json |
|
||||||
["EVENT",{"kind":1,"content":"Test event","created_at":1234567890,"tags":[],"pubkey":"...","id":"...","sig":"..."}] |
|
||||||
``` |
|
||||||
|
|
||||||
## What to Look For |
|
||||||
|
|
||||||
### ✅ Success Indicators |
|
||||||
|
|
||||||
1. **Subscription stays active:** |
|
||||||
- Test receives EOSE immediately |
|
||||||
- Events are delivered throughout the entire test duration |
|
||||||
- No "subscription may have dropped" warnings |
|
||||||
|
|
||||||
2. **Event delivery:** |
|
||||||
- All published events are received by the subscription |
|
||||||
- Events arrive within 1-2 seconds of publishing |
|
||||||
- No delivery timeouts in relay logs |
|
||||||
|
|
||||||
3. **Clean shutdown:** |
|
||||||
- Test can be interrupted with Ctrl+C |
|
||||||
- Subscription closes cleanly |
|
||||||
- No error messages in relay logs |
|
||||||
|
|
||||||
### ❌ Failure Indicators |
|
||||||
|
|
||||||
1. **Subscription drops:** |
|
||||||
- Events stop being received after ~30-60 seconds |
|
||||||
- Warning: "No events received for Xs" |
|
||||||
- Relay logs show timeout errors |
|
||||||
|
|
||||||
2. **Event delivery failures:** |
|
||||||
- Events are published but not received |
|
||||||
- Relay logs show "delivery TIMEOUT" messages |
|
||||||
- Subscription is removed from publisher |
|
||||||
|
|
||||||
3. **Resource leaks:** |
|
||||||
- Memory usage grows over time |
|
||||||
- Goroutine count increases continuously |
|
||||||
- Connection not cleaned up properly |
|
||||||
|
|
||||||
## Test Scenarios |
|
||||||
|
|
||||||
### 1. Basic Long-Running Test |
|
||||||
|
|
||||||
**Duration:** 60 seconds |
|
||||||
**Event Rate:** 1 event every 2-5 seconds |
|
||||||
**Expected:** All events received, subscription stays active |
|
||||||
|
|
||||||
```bash |
|
||||||
./subscription-test -url ws://localhost:3334 -duration 60 |
|
||||||
``` |
|
||||||
|
|
||||||
### 2. Extended Duration Test |
|
||||||
|
|
||||||
**Duration:** 300 seconds (5 minutes) |
|
||||||
**Event Rate:** 1 event every 10 seconds |
|
||||||
**Expected:** All events received throughout 5 minutes |
|
||||||
|
|
||||||
```bash |
|
||||||
./subscription-test -url ws://localhost:3334 -duration 300 |
|
||||||
``` |
|
||||||
|
|
||||||
### 3. Multiple Subscriptions |
|
||||||
|
|
||||||
Run multiple test instances simultaneously: |
|
||||||
|
|
||||||
```bash |
|
||||||
# Terminal 1 |
|
||||||
./subscription-test -url ws://localhost:3334 -duration 120 -kind 1 -sub sub1 |
|
||||||
|
|
||||||
# Terminal 2 |
|
||||||
./subscription-test -url ws://localhost:3334 -duration 120 -kind 1 -sub sub2 |
|
||||||
|
|
||||||
# Terminal 3 |
|
||||||
./subscription-test -url ws://localhost:3334 -duration 120 -kind 1 -sub sub3 |
|
||||||
``` |
|
||||||
|
|
||||||
**Expected:** All subscriptions receive events independently |
|
||||||
|
|
||||||
### 4. Idle Subscription Test |
|
||||||
|
|
||||||
**Duration:** 120 seconds |
|
||||||
**Event Rate:** Publish events only at start and end |
|
||||||
**Expected:** Subscription remains active even during long idle period |
|
||||||
|
|
||||||
```bash |
|
||||||
# Start test |
|
||||||
./subscription-test -url ws://localhost:3334 -duration 120 |
|
||||||
|
|
||||||
# Publish 1-2 events immediately |
|
||||||
# Wait 100 seconds (subscription should stay alive) |
|
||||||
# Publish 1-2 more events |
|
||||||
# Verify test receives the late events |
|
||||||
``` |
|
||||||
|
|
||||||
## Debugging |
|
||||||
|
|
||||||
### Enable Verbose Logging |
|
||||||
|
|
||||||
```bash |
|
||||||
# Relay |
|
||||||
export ORLY_LOG_LEVEL=debug |
|
||||||
./orly |
|
||||||
|
|
||||||
# Test tool |
|
||||||
./subscription-test -url ws://localhost:3334 -duration 60 -v |
|
||||||
``` |
|
||||||
|
|
||||||
### Check Relay Logs |
|
||||||
|
|
||||||
Look for these log patterns: |
|
||||||
|
|
||||||
**Good (working subscription):** |
|
||||||
``` |
|
||||||
subscription test-123456 created and goroutine launched for 127.0.0.1 |
|
||||||
delivered real-time event abc123... to subscription test-123456 @ 127.0.0.1 |
|
||||||
subscription delivery QUEUED: event=abc123... to=127.0.0.1 |
|
||||||
``` |
|
||||||
|
|
||||||
**Bad (subscription issues):** |
|
||||||
``` |
|
||||||
subscription delivery TIMEOUT: event=abc123... |
|
||||||
removing failed subscriber connection |
|
||||||
subscription goroutine exiting unexpectedly |
|
||||||
``` |
|
||||||
|
|
||||||
### Monitor Resource Usage |
|
||||||
|
|
||||||
```bash |
|
||||||
# Watch memory usage |
|
||||||
watch -n 1 'ps aux | grep orly' |
|
||||||
|
|
||||||
# Check goroutine count (requires pprof enabled) |
|
||||||
curl http://localhost:6060/debug/pprof/goroutine?debug=1 |
|
||||||
``` |
|
||||||
|
|
||||||
## Expected Performance |
|
||||||
|
|
||||||
With the fixes applied: |
|
||||||
|
|
||||||
- **Subscription lifetime:** Unlimited (hours/days) |
|
||||||
- **Event delivery latency:** < 100ms |
|
||||||
- **Max concurrent subscriptions:** Thousands per relay |
|
||||||
- **Memory per subscription:** ~10KB (goroutine + buffers) |
|
||||||
- **CPU overhead:** Minimal (event-driven) |
|
||||||
|
|
||||||
## Automated Tests |
|
||||||
|
|
||||||
Run the Go test suite: |
|
||||||
|
|
||||||
```bash |
|
||||||
# Run all tests |
|
||||||
./scripts/test.sh |
|
||||||
|
|
||||||
# Run subscription tests only (once implemented) |
|
||||||
go test -v -run TestLongRunningSubscription ./app |
|
||||||
go test -v -run TestMultipleConcurrentSubscriptions ./app |
|
||||||
``` |
|
||||||
|
|
||||||
## Common Issues |
|
||||||
|
|
||||||
### Issue: "Failed to connect" |
|
||||||
|
|
||||||
**Cause:** Relay not running or wrong URL |
|
||||||
**Solution:** |
|
||||||
```bash |
|
||||||
# Check relay is running |
|
||||||
ps aux | grep orly |
|
||||||
|
|
||||||
# Verify port |
|
||||||
netstat -tlnp | grep 3334 |
|
||||||
``` |
|
||||||
|
|
||||||
### Issue: "No events received" |
|
||||||
|
|
||||||
**Cause:** No events being published |
|
||||||
**Solution:** Publish test events while test is running (see section 3 above) |
|
||||||
|
|
||||||
### Issue: "Subscription CLOSED by relay" |
|
||||||
|
|
||||||
**Cause:** Filter policy or ACL rejecting subscription |
|
||||||
**Solution:** Check relay configuration and ACL settings |
|
||||||
|
|
||||||
### Issue: Test hangs at EOSE |
|
||||||
|
|
||||||
**Cause:** Relay not sending EOSE |
|
||||||
**Solution:** Check relay logs for query errors |
|
||||||
|
|
||||||
## Manual Testing with Raw WebSocket |
|
||||||
|
|
||||||
If you prefer manual testing, you can use any WebSocket client: |
|
||||||
|
|
||||||
```bash |
|
||||||
# Install wscat (Node.js based, no glibc issues) |
|
||||||
npm install -g wscat |
|
||||||
|
|
||||||
# Connect and subscribe |
|
||||||
wscat -c ws://localhost:3334 |
|
||||||
> ["REQ","manual-test",{"kinds":[1]}] |
|
||||||
|
|
||||||
# Wait for EOSE |
|
||||||
< ["EOSE","manual-test"] |
|
||||||
|
|
||||||
# Events should arrive as they're published |
|
||||||
< ["EVENT","manual-test",{"id":"...","kind":1,...}] |
|
||||||
``` |
|
||||||
|
|
||||||
## Comparison: Before vs After Fixes |
|
||||||
|
|
||||||
### Before (Broken) |
|
||||||
|
|
||||||
``` |
|
||||||
$ ./subscription-test -duration 60 |
|
||||||
✓ Connected |
|
||||||
✓ Received EOSE |
|
||||||
[EVENT #1] id=abc123... kind=1 |
|
||||||
[EVENT #2] id=def456... kind=1 |
|
||||||
... |
|
||||||
[EVENT #30] id=xyz789... kind=1 |
|
||||||
⚠ Warning: No events received for 35s - subscription may have dropped |
|
||||||
Test complete: 30 events received (expected 60) |
|
||||||
``` |
|
||||||
|
|
||||||
### After (Fixed) |
|
||||||
|
|
||||||
``` |
|
||||||
$ ./subscription-test -duration 60 |
|
||||||
✓ Connected |
|
||||||
✓ Received EOSE |
|
||||||
[EVENT #1] id=abc123... kind=1 |
|
||||||
[EVENT #2] id=def456... kind=1 |
|
||||||
... |
|
||||||
[EVENT #60] id=xyz789... kind=1 |
|
||||||
✓ TEST PASSED - Subscription remained stable |
|
||||||
Test complete: 60 events received |
|
||||||
``` |
|
||||||
|
|
||||||
## Reporting Issues |
|
||||||
|
|
||||||
If subscriptions still drop after the fixes, please report with: |
|
||||||
|
|
||||||
1. Relay logs (with `ORLY_LOG_LEVEL=debug`) |
|
||||||
2. Test output |
|
||||||
3. Steps to reproduce |
|
||||||
4. Relay configuration |
|
||||||
5. Event publishing method |
|
||||||
|
|
||||||
## Summary |
|
||||||
|
|
||||||
The subscription stability fixes ensure: |
|
||||||
|
|
||||||
✅ Subscriptions remain active indefinitely |
|
||||||
✅ All events are delivered without timeouts |
|
||||||
✅ Clean resource management (no leaks) |
|
||||||
✅ Multiple concurrent subscriptions work correctly |
|
||||||
✅ Idle subscriptions don't timeout |
|
||||||
|
|
||||||
Follow the test scenarios above to verify these improvements in your deployment. |
|
||||||
@ -1,108 +0,0 @@ |
|||||||
# Test Subscription Stability NOW |
|
||||||
|
|
||||||
## Quick Test (No Events Required) |
|
||||||
|
|
||||||
This test verifies the subscription stays registered without needing to publish events: |
|
||||||
|
|
||||||
```bash |
|
||||||
# Terminal 1: Start relay |
|
||||||
./orly |
|
||||||
|
|
||||||
# Terminal 2: Run simple test |
|
||||||
./subscription-test-simple -url ws://localhost:3334 -duration 120 |
|
||||||
``` |
|
||||||
|
|
||||||
**Expected output:** |
|
||||||
``` |
|
||||||
✓ Connected |
|
||||||
✓ Received EOSE - subscription is active |
|
||||||
|
|
||||||
Subscription is active. Monitoring for 120 seconds... |
|
||||||
|
|
||||||
[ 10s/120s] Messages: 1 | Last message: 5s ago | Status: ACTIVE (recent message) |
|
||||||
[ 20s/120s] Messages: 1 | Last message: 15s ago | Status: IDLE (normal) |
|
||||||
[ 30s/120s] Messages: 1 | Last message: 25s ago | Status: IDLE (normal) |
|
||||||
... |
|
||||||
[120s/120s] Messages: 1 | Last message: 115s ago | Status: QUIET (possibly normal) |
|
||||||
|
|
||||||
✓ TEST PASSED |
|
||||||
Subscription remained active throughout test period. |
|
||||||
``` |
|
||||||
|
|
||||||
## Full Test (With Events) |
|
||||||
|
|
||||||
For comprehensive testing with event delivery: |
|
||||||
|
|
||||||
```bash |
|
||||||
# Terminal 1: Start relay |
|
||||||
./orly |
|
||||||
|
|
||||||
# Terminal 2: Run test |
|
||||||
./subscription-test -url ws://localhost:3334 -duration 60 |
|
||||||
|
|
||||||
# Terminal 3: Publish test events |
|
||||||
# Use your preferred method to publish events to the relay |
|
||||||
# The test will show events being received |
|
||||||
``` |
|
||||||
|
|
||||||
## What the Fixes Do |
|
||||||
|
|
||||||
### Before (Broken) |
|
||||||
- Subscriptions dropped after ~30-60 seconds |
|
||||||
- Receiver channels filled up (32 event buffer) |
|
||||||
- Publisher timed out trying to send |
|
||||||
- Events stopped being delivered |
|
||||||
|
|
||||||
### After (Fixed) |
|
||||||
- Subscriptions stay active indefinitely |
|
||||||
- Per-subscription consumer goroutines |
|
||||||
- Channels never fill up |
|
||||||
- All events delivered without timeouts |
|
||||||
|
|
||||||
## Troubleshooting |
|
||||||
|
|
||||||
### "Failed to connect" |
|
||||||
```bash |
|
||||||
# Check relay is running |
|
||||||
ps aux | grep orly |
|
||||||
|
|
||||||
# Check port |
|
||||||
netstat -tlnp | grep 3334 |
|
||||||
``` |
|
||||||
|
|
||||||
### "Did not receive EOSE" |
|
||||||
```bash |
|
||||||
# Enable debug logging |
|
||||||
export ORLY_LOG_LEVEL=debug |
|
||||||
./orly |
|
||||||
``` |
|
||||||
|
|
||||||
### Test panics |
|
||||||
Already fixed! The latest version includes proper error handling. |
|
||||||
|
|
||||||
## Files Changed |
|
||||||
|
|
||||||
Core fixes in these files: |
|
||||||
- `app/listener.go` - Subscription tracking + **concurrent message processing** |
|
||||||
- `app/handle-req.go` - Consumer goroutines (THE KEY FIX) |
|
||||||
- `app/handle-close.go` - Proper cleanup |
|
||||||
- `app/handle-websocket.go` - Cancel all on disconnect |
|
||||||
|
|
||||||
**Latest fix:** Message processor now handles messages concurrently (prevents queue from filling up) |
|
||||||
|
|
||||||
## Build Status |
|
||||||
|
|
||||||
✅ All code builds successfully: |
|
||||||
```bash |
|
||||||
go build -o orly # Relay |
|
||||||
go build -o subscription-test ./cmd/subscription-test # Full test |
|
||||||
go build -o subscription-test-simple ./cmd/subscription-test-simple # Simple test |
|
||||||
``` |
|
||||||
|
|
||||||
## Quick Summary |
|
||||||
|
|
||||||
**Problem:** Receiver channels created but never consumed → filled up → timeout → subscription dropped |
|
||||||
|
|
||||||
**Solution:** Per-subscription consumer goroutines (khatru pattern) that continuously read from channels and forward events to clients |
|
||||||
|
|
||||||
**Result:** Subscriptions now stable for unlimited duration ✅ |
|
||||||
@ -1,273 +0,0 @@ |
|||||||
package main |
|
||||||
|
|
||||||
import ( |
|
||||||
"encoding/json" |
|
||||||
"fmt" |
|
||||||
"net" |
|
||||||
"os" |
|
||||||
"path/filepath" |
|
||||||
"strings" |
|
||||||
"testing" |
|
||||||
"time" |
|
||||||
|
|
||||||
lol "lol.mleku.dev" |
|
||||||
"next.orly.dev/app/config" |
|
||||||
"next.orly.dev/pkg/encoders/event" |
|
||||||
"next.orly.dev/pkg/encoders/tag" |
|
||||||
"next.orly.dev/pkg/interfaces/signer/p8k" |
|
||||||
"next.orly.dev/pkg/policy" |
|
||||||
"next.orly.dev/pkg/run" |
|
||||||
relaytester "next.orly.dev/relay-tester" |
|
||||||
) |
|
||||||
|
|
||||||
// TestClusterPeerPolicyFiltering tests cluster peer synchronization with policy filtering.
|
|
||||||
// This test:
|
|
||||||
// 1. Starts multiple relays using the test relay launch functionality
|
|
||||||
// 2. Configures them as peers to each other (though sync managers are not fully implemented in this test)
|
|
||||||
// 3. Tests policy filtering with a kind whitelist that allows only specific event kinds
|
|
||||||
// 4. Verifies that the policy correctly allows/denies events based on the whitelist
|
|
||||||
//
|
|
||||||
// Note: This test focuses on the policy filtering aspect of cluster peers.
|
|
||||||
// Full cluster synchronization testing would require implementing the sync manager
|
|
||||||
// integration, which is beyond the scope of this initial test.
|
|
||||||
func TestClusterPeerPolicyFiltering(t *testing.T) { |
|
||||||
if testing.Short() { |
|
||||||
t.Skip("skipping cluster peer integration test") |
|
||||||
} |
|
||||||
|
|
||||||
// Number of relays in the cluster
|
|
||||||
numRelays := 3 |
|
||||||
|
|
||||||
// Start multiple test relays
|
|
||||||
relays, ports, err := startTestRelays(numRelays) |
|
||||||
if err != nil { |
|
||||||
t.Fatalf("Failed to start test relays: %v", err) |
|
||||||
} |
|
||||||
defer func() { |
|
||||||
for _, relay := range relays { |
|
||||||
if tr, ok := relay.(*testRelay); ok { |
|
||||||
if stopErr := tr.Stop(); stopErr != nil { |
|
||||||
t.Logf("Error stopping relay: %v", stopErr) |
|
||||||
} |
|
||||||
} |
|
||||||
} |
|
||||||
}() |
|
||||||
|
|
||||||
// Create relay URLs
|
|
||||||
relayURLs := make([]string, numRelays) |
|
||||||
for i, port := range ports { |
|
||||||
relayURLs[i] = fmt.Sprintf("http://127.0.0.1:%d", port) |
|
||||||
} |
|
||||||
|
|
||||||
// Wait for all relays to be ready
|
|
||||||
for _, url := range relayURLs { |
|
||||||
wsURL := strings.Replace(url, "http://", "ws://", 1) // Convert http to ws
|
|
||||||
if err := waitForTestRelay(wsURL, 10*time.Second); err != nil { |
|
||||||
t.Fatalf("Relay not ready after timeout: %s, %v", wsURL, err) |
|
||||||
} |
|
||||||
t.Logf("Relay is ready at %s", wsURL) |
|
||||||
} |
|
||||||
|
|
||||||
// Create policy configuration with small kind whitelist
|
|
||||||
policyJSON := map[string]interface{}{ |
|
||||||
"kind": map[string]interface{}{ |
|
||||||
"whitelist": []int{1, 7, 42}, // Allow only text notes, user statuses, and channel messages
|
|
||||||
}, |
|
||||||
"default_policy": "allow", // Allow everything not explicitly denied
|
|
||||||
} |
|
||||||
|
|
||||||
policyJSONBytes, err := json.MarshalIndent(policyJSON, "", " ") |
|
||||||
if err != nil { |
|
||||||
t.Fatalf("Failed to marshal policy JSON: %v", err) |
|
||||||
} |
|
||||||
|
|
||||||
// Create temporary directory for policy config
|
|
||||||
tempDir := t.TempDir() |
|
||||||
configDir := filepath.Join(tempDir, "ORLY_POLICY") |
|
||||||
if err := os.MkdirAll(configDir, 0755); err != nil { |
|
||||||
t.Fatalf("Failed to create config directory: %v", err) |
|
||||||
} |
|
||||||
|
|
||||||
policyPath := filepath.Join(configDir, "policy.json") |
|
||||||
if err := os.WriteFile(policyPath, policyJSONBytes, 0644); err != nil { |
|
||||||
t.Fatalf("Failed to write policy file: %v", err) |
|
||||||
} |
|
||||||
|
|
||||||
// Create policy from JSON directly for testing
|
|
||||||
testPolicy, err := policy.New(policyJSONBytes) |
|
||||||
if err != nil { |
|
||||||
t.Fatalf("Failed to create policy: %v", err) |
|
||||||
} |
|
||||||
|
|
||||||
// Generate test keys
|
|
||||||
signer := p8k.MustNew() |
|
||||||
if err := signer.Generate(); err != nil { |
|
||||||
t.Fatalf("Failed to generate test signer: %v", err) |
|
||||||
} |
|
||||||
|
|
||||||
// Create test events of different kinds
|
|
||||||
testEvents := []*event.E{ |
|
||||||
// Kind 1 (text note) - should be allowed by policy
|
|
||||||
createTestEvent(t, signer, "Text note - should sync", 1), |
|
||||||
// Kind 7 (user status) - should be allowed by policy
|
|
||||||
createTestEvent(t, signer, "User status - should sync", 7), |
|
||||||
// Kind 42 (channel message) - should be allowed by policy
|
|
||||||
createTestEvent(t, signer, "Channel message - should sync", 42), |
|
||||||
// Kind 0 (metadata) - should be denied by policy
|
|
||||||
createTestEvent(t, signer, "Metadata - should NOT sync", 0), |
|
||||||
// Kind 3 (follows) - should be denied by policy
|
|
||||||
createTestEvent(t, signer, "Follows - should NOT sync", 3), |
|
||||||
} |
|
||||||
|
|
||||||
t.Logf("Created %d test events", len(testEvents)) |
|
||||||
|
|
||||||
// Publish events to the first relay (non-policy relay)
|
|
||||||
firstRelayWS := fmt.Sprintf("ws://127.0.0.1:%d", ports[0]) |
|
||||||
client, err := relaytester.NewClient(firstRelayWS) |
|
||||||
if err != nil { |
|
||||||
t.Fatalf("Failed to connect to first relay: %v", err) |
|
||||||
} |
|
||||||
defer client.Close() |
|
||||||
|
|
||||||
// Publish all events to the first relay
|
|
||||||
for i, ev := range testEvents { |
|
||||||
if err := client.Publish(ev); err != nil { |
|
||||||
t.Fatalf("Failed to publish event %d: %v", i, err) |
|
||||||
} |
|
||||||
|
|
||||||
// Wait for OK response
|
|
||||||
accepted, reason, err := client.WaitForOK(ev.ID, 5*time.Second) |
|
||||||
if err != nil { |
|
||||||
t.Fatalf("Failed to get OK response for event %d: %v", i, err) |
|
||||||
} |
|
||||||
if !accepted { |
|
||||||
t.Logf("Event %d rejected: %s (kind: %d)", i, reason, ev.Kind) |
|
||||||
} else { |
|
||||||
t.Logf("Event %d accepted (kind: %d)", i, ev.Kind) |
|
||||||
} |
|
||||||
} |
|
||||||
|
|
||||||
// Test policy filtering directly
|
|
||||||
t.Logf("Testing policy filtering...") |
|
||||||
|
|
||||||
// Test that the policy correctly allows/denies events based on the whitelist
|
|
||||||
// Only kinds 1, 7, and 42 should be allowed
|
|
||||||
for i, ev := range testEvents { |
|
||||||
allowed, err := testPolicy.CheckPolicy("write", ev, signer.Pub(), "127.0.0.1") |
|
||||||
if err != nil { |
|
||||||
t.Fatalf("Policy check failed for event %d: %v", i, err) |
|
||||||
} |
|
||||||
|
|
||||||
expectedAllowed := ev.Kind == 1 || ev.Kind == 7 || ev.Kind == 42 |
|
||||||
if allowed != expectedAllowed { |
|
||||||
t.Errorf("Event %d (kind %d): expected allowed=%v, got %v", i, ev.Kind, expectedAllowed, allowed) |
|
||||||
} |
|
||||||
} |
|
||||||
|
|
||||||
t.Logf("Policy filtering test completed successfully") |
|
||||||
|
|
||||||
// Note: In a real cluster setup, the sync manager would use this policy
|
|
||||||
// to filter events during synchronization between peers. This test demonstrates
|
|
||||||
// that the policy correctly identifies which events should be allowed to sync.
|
|
||||||
} |
|
||||||
|
|
||||||
// testRelay wraps a run.Relay for testing purposes
|
|
||||||
type testRelay struct { |
|
||||||
*run.Relay |
|
||||||
} |
|
||||||
|
|
||||||
// startTestRelays starts multiple test relays with different configurations
|
|
||||||
func startTestRelays(count int) ([]interface{}, []int, error) { |
|
||||||
relays := make([]interface{}, count) |
|
||||||
ports := make([]int, count) |
|
||||||
|
|
||||||
for i := 0; i < count; i++ { |
|
||||||
cfg := &config.C{ |
|
||||||
AppName: fmt.Sprintf("ORLY-TEST-%d", i), |
|
||||||
DataDir: "", // Use temp dir
|
|
||||||
Listen: "127.0.0.1", |
|
||||||
Port: 0, // Random port
|
|
||||||
HealthPort: 0, |
|
||||||
EnableShutdown: false, |
|
||||||
LogLevel: "warn", |
|
||||||
DBLogLevel: "warn", |
|
||||||
DBBlockCacheMB: 512, |
|
||||||
DBIndexCacheMB: 256, |
|
||||||
LogToStdout: false, |
|
||||||
PprofHTTP: false, |
|
||||||
ACLMode: "none", |
|
||||||
AuthRequired: false, |
|
||||||
AuthToWrite: false, |
|
||||||
SubscriptionEnabled: false, |
|
||||||
MonthlyPriceSats: 6000, |
|
||||||
FollowListFrequency: time.Hour, |
|
||||||
WebDisableEmbedded: false, |
|
||||||
SprocketEnabled: false, |
|
||||||
SpiderMode: "none", |
|
||||||
PolicyEnabled: false, // We'll enable it separately for one relay
|
|
||||||
} |
|
||||||
|
|
||||||
// Find available port
|
|
||||||
listener, err := net.Listen("tcp", "127.0.0.1:0") |
|
||||||
if err != nil { |
|
||||||
return nil, nil, fmt.Errorf("failed to find available port for relay %d: %w", i, err) |
|
||||||
} |
|
||||||
addr := listener.Addr().(*net.TCPAddr) |
|
||||||
cfg.Port = addr.Port |
|
||||||
listener.Close() |
|
||||||
|
|
||||||
// Set up logging
|
|
||||||
lol.SetLogLevel(cfg.LogLevel) |
|
||||||
|
|
||||||
opts := &run.Options{ |
|
||||||
CleanupDataDir: func(b bool) *bool { return &b }(true), |
|
||||||
} |
|
||||||
|
|
||||||
relay, err := run.Start(cfg, opts) |
|
||||||
if err != nil { |
|
||||||
return nil, nil, fmt.Errorf("failed to start relay %d: %w", i, err) |
|
||||||
} |
|
||||||
|
|
||||||
relays[i] = &testRelay{Relay: relay} |
|
||||||
ports[i] = cfg.Port |
|
||||||
} |
|
||||||
|
|
||||||
return relays, ports, nil |
|
||||||
} |
|
||||||
|
|
||||||
// waitForTestRelay waits for a relay to be ready by attempting to connect
|
|
||||||
func waitForTestRelay(url string, timeout time.Duration) error { |
|
||||||
// Extract host:port from ws:// URL
|
|
||||||
addr := url |
|
||||||
if len(url) > 5 && url[:5] == "ws://" { |
|
||||||
addr = url[5:] |
|
||||||
} |
|
||||||
deadline := time.Now().Add(timeout) |
|
||||||
attempts := 0 |
|
||||||
for time.Now().Before(deadline) { |
|
||||||
conn, err := net.DialTimeout("tcp", addr, 500*time.Millisecond) |
|
||||||
if err == nil { |
|
||||||
conn.Close() |
|
||||||
return nil |
|
||||||
} |
|
||||||
attempts++ |
|
||||||
time.Sleep(100 * time.Millisecond) |
|
||||||
} |
|
||||||
return fmt.Errorf("timeout waiting for relay at %s after %d attempts", url, attempts) |
|
||||||
} |
|
||||||
|
|
||||||
// createTestEvent creates a test event with proper signing
|
|
||||||
func createTestEvent(t *testing.T, signer *p8k.Signer, content string, eventKind uint16) *event.E { |
|
||||||
ev := event.New() |
|
||||||
ev.CreatedAt = time.Now().Unix() |
|
||||||
ev.Kind = eventKind |
|
||||||
ev.Content = []byte(content) |
|
||||||
ev.Tags = tag.NewS() |
|
||||||
|
|
||||||
// Sign the event
|
|
||||||
if err := ev.Sign(signer); err != nil { |
|
||||||
t.Fatalf("Failed to sign test event: %v", err) |
|
||||||
} |
|
||||||
|
|
||||||
return ev |
|
||||||
} |
|
||||||
@ -1,245 +0,0 @@ |
|||||||
package main |
|
||||||
|
|
||||||
import ( |
|
||||||
"fmt" |
|
||||||
"net" |
|
||||||
"os" |
|
||||||
"path/filepath" |
|
||||||
"testing" |
|
||||||
"time" |
|
||||||
|
|
||||||
lol "lol.mleku.dev" |
|
||||||
"next.orly.dev/app/config" |
|
||||||
"next.orly.dev/pkg/run" |
|
||||||
relaytester "next.orly.dev/relay-tester" |
|
||||||
) |
|
||||||
|
|
||||||
var ( |
|
||||||
testRelayURL string |
|
||||||
testName string |
|
||||||
testJSON bool |
|
||||||
keepDataDir bool |
|
||||||
relayPort int |
|
||||||
relayDataDir string |
|
||||||
) |
|
||||||
|
|
||||||
func TestRelay(t *testing.T) { |
|
||||||
var err error |
|
||||||
var relay *run.Relay |
|
||||||
var relayURL string |
|
||||||
|
|
||||||
// Determine relay URL
|
|
||||||
if testRelayURL != "" { |
|
||||||
relayURL = testRelayURL |
|
||||||
} else { |
|
||||||
// Start local relay for testing
|
|
||||||
var port int |
|
||||||
if relay, port, err = startTestRelay(); err != nil { |
|
||||||
t.Fatalf("Failed to start test relay: %v", err) |
|
||||||
} |
|
||||||
defer func() { |
|
||||||
if stopErr := relay.Stop(); stopErr != nil { |
|
||||||
t.Logf("Error stopping relay: %v", stopErr) |
|
||||||
} |
|
||||||
}() |
|
||||||
relayURL = fmt.Sprintf("ws://127.0.0.1:%d", port) |
|
||||||
t.Logf("Waiting for relay to be ready at %s...", relayURL) |
|
||||||
// Wait for relay to be ready - try connecting to verify it's up
|
|
||||||
if err = waitForRelay(relayURL, 10*time.Second); err != nil { |
|
||||||
t.Fatalf("Relay not ready after timeout: %v", err) |
|
||||||
} |
|
||||||
t.Logf("Relay is ready at %s", relayURL) |
|
||||||
} |
|
||||||
|
|
||||||
// Create test suite
|
|
||||||
t.Logf("Creating test suite for %s...", relayURL) |
|
||||||
suite, err := relaytester.NewTestSuite(relayURL) |
|
||||||
if err != nil { |
|
||||||
t.Fatalf("Failed to create test suite: %v", err) |
|
||||||
} |
|
||||||
t.Logf("Test suite created, running tests...") |
|
||||||
|
|
||||||
// Run tests
|
|
||||||
var results []relaytester.TestResult |
|
||||||
if testName != "" { |
|
||||||
// Run specific test
|
|
||||||
result, err := suite.RunTest(testName) |
|
||||||
if err != nil { |
|
||||||
t.Fatalf("Failed to run test %s: %v", testName, err) |
|
||||||
} |
|
||||||
results = []relaytester.TestResult{result} |
|
||||||
} else { |
|
||||||
// Run all tests
|
|
||||||
if results, err = suite.Run(); err != nil { |
|
||||||
t.Fatalf("Failed to run tests: %v", err) |
|
||||||
} |
|
||||||
} |
|
||||||
|
|
||||||
// Output results
|
|
||||||
if testJSON { |
|
||||||
jsonOutput, err := relaytester.FormatJSON(results) |
|
||||||
if err != nil { |
|
||||||
t.Fatalf("Failed to format JSON: %v", err) |
|
||||||
} |
|
||||||
fmt.Println(jsonOutput) |
|
||||||
} else { |
|
||||||
outputResults(results, t) |
|
||||||
} |
|
||||||
|
|
||||||
// Check if any required tests failed
|
|
||||||
for _, result := range results { |
|
||||||
if result.Required && !result.Pass { |
|
||||||
t.Errorf("Required test '%s' failed: %s", result.Name, result.Info) |
|
||||||
} |
|
||||||
} |
|
||||||
} |
|
||||||
|
|
||||||
func startTestRelay() (relay *run.Relay, port int, err error) { |
|
||||||
cfg := &config.C{ |
|
||||||
AppName: "ORLY-TEST", |
|
||||||
DataDir: relayDataDir, |
|
||||||
Listen: "127.0.0.1", |
|
||||||
Port: 0, // Always use random port, unless overridden via -port flag
|
|
||||||
HealthPort: 0, |
|
||||||
EnableShutdown: false, |
|
||||||
LogLevel: "warn", |
|
||||||
DBLogLevel: "warn", |
|
||||||
DBBlockCacheMB: 512, |
|
||||||
DBIndexCacheMB: 256, |
|
||||||
LogToStdout: false, |
|
||||||
PprofHTTP: false, |
|
||||||
ACLMode: "none", |
|
||||||
AuthRequired: false, |
|
||||||
AuthToWrite: false, |
|
||||||
SubscriptionEnabled: false, |
|
||||||
MonthlyPriceSats: 6000, |
|
||||||
FollowListFrequency: time.Hour, |
|
||||||
WebDisableEmbedded: false, |
|
||||||
SprocketEnabled: false, |
|
||||||
SpiderMode: "none", |
|
||||||
PolicyEnabled: false, |
|
||||||
} |
|
||||||
|
|
||||||
// Use explicitly set port if provided via flag, otherwise find an available port
|
|
||||||
if relayPort > 0 { |
|
||||||
cfg.Port = relayPort |
|
||||||
} else { |
|
||||||
var listener net.Listener |
|
||||||
if listener, err = net.Listen("tcp", "127.0.0.1:0"); err != nil { |
|
||||||
return nil, 0, fmt.Errorf("failed to find available port: %w", err) |
|
||||||
} |
|
||||||
addr := listener.Addr().(*net.TCPAddr) |
|
||||||
cfg.Port = addr.Port |
|
||||||
listener.Close() |
|
||||||
} |
|
||||||
|
|
||||||
// Set default data dir if not specified
|
|
||||||
if cfg.DataDir == "" { |
|
||||||
tmpDir := filepath.Join(os.TempDir(), fmt.Sprintf("orly-test-%d", time.Now().UnixNano())) |
|
||||||
cfg.DataDir = tmpDir |
|
||||||
} |
|
||||||
|
|
||||||
// Set up logging
|
|
||||||
lol.SetLogLevel(cfg.LogLevel) |
|
||||||
|
|
||||||
// Create options
|
|
||||||
cleanup := !keepDataDir |
|
||||||
opts := &run.Options{ |
|
||||||
CleanupDataDir: &cleanup, |
|
||||||
} |
|
||||||
|
|
||||||
// Start relay
|
|
||||||
if relay, err = run.Start(cfg, opts); err != nil { |
|
||||||
return nil, 0, fmt.Errorf("failed to start relay: %w", err) |
|
||||||
} |
|
||||||
|
|
||||||
return relay, cfg.Port, nil |
|
||||||
} |
|
||||||
|
|
||||||
// waitForRelay waits for the relay to be ready by attempting to connect
|
|
||||||
func waitForRelay(url string, timeout time.Duration) error { |
|
||||||
// Extract host:port from ws:// URL
|
|
||||||
addr := url |
|
||||||
if len(url) > 7 && url[:5] == "ws://" { |
|
||||||
addr = url[5:] |
|
||||||
} |
|
||||||
deadline := time.Now().Add(timeout) |
|
||||||
attempts := 0 |
|
||||||
for time.Now().Before(deadline) { |
|
||||||
conn, err := net.DialTimeout("tcp", addr, 500*time.Millisecond) |
|
||||||
if err == nil { |
|
||||||
conn.Close() |
|
||||||
return nil |
|
||||||
} |
|
||||||
attempts++ |
|
||||||
if attempts%10 == 0 { |
|
||||||
// Log every 10th attempt (every second)
|
|
||||||
} |
|
||||||
time.Sleep(100 * time.Millisecond) |
|
||||||
} |
|
||||||
return fmt.Errorf("timeout waiting for relay at %s after %d attempts", url, attempts) |
|
||||||
} |
|
||||||
|
|
||||||
func outputResults(results []relaytester.TestResult, t *testing.T) { |
|
||||||
passed := 0 |
|
||||||
failed := 0 |
|
||||||
requiredFailed := 0 |
|
||||||
|
|
||||||
for _, result := range results { |
|
||||||
if result.Pass { |
|
||||||
passed++ |
|
||||||
t.Logf("PASS: %s", result.Name) |
|
||||||
} else { |
|
||||||
failed++ |
|
||||||
if result.Required { |
|
||||||
requiredFailed++ |
|
||||||
t.Errorf("FAIL (required): %s - %s", result.Name, result.Info) |
|
||||||
} else { |
|
||||||
t.Logf("FAIL (optional): %s - %s", result.Name, result.Info) |
|
||||||
} |
|
||||||
} |
|
||||||
} |
|
||||||
|
|
||||||
t.Logf("\nTest Summary:") |
|
||||||
t.Logf(" Total: %d", len(results)) |
|
||||||
t.Logf(" Passed: %d", passed) |
|
||||||
t.Logf(" Failed: %d", failed) |
|
||||||
t.Logf(" Required Failed: %d", requiredFailed) |
|
||||||
} |
|
||||||
|
|
||||||
// TestMain allows custom test setup/teardown
|
|
||||||
func TestMain(m *testing.M) { |
|
||||||
// Manually parse our custom flags to avoid conflicts with Go's test flags
|
|
||||||
for i := 1; i < len(os.Args); i++ { |
|
||||||
arg := os.Args[i] |
|
||||||
switch arg { |
|
||||||
case "-relay-url": |
|
||||||
if i+1 < len(os.Args) { |
|
||||||
testRelayURL = os.Args[i+1] |
|
||||||
i++ |
|
||||||
} |
|
||||||
case "-test-name": |
|
||||||
if i+1 < len(os.Args) { |
|
||||||
testName = os.Args[i+1] |
|
||||||
i++ |
|
||||||
} |
|
||||||
case "-json": |
|
||||||
testJSON = true |
|
||||||
case "-keep-data": |
|
||||||
keepDataDir = true |
|
||||||
case "-port": |
|
||||||
if i+1 < len(os.Args) { |
|
||||||
fmt.Sscanf(os.Args[i+1], "%d", &relayPort) |
|
||||||
i++ |
|
||||||
} |
|
||||||
case "-data-dir": |
|
||||||
if i+1 < len(os.Args) { |
|
||||||
relayDataDir = os.Args[i+1] |
|
||||||
i++ |
|
||||||
} |
|
||||||
} |
|
||||||
} |
|
||||||
|
|
||||||
code := m.Run() |
|
||||||
os.Exit(code) |
|
||||||
} |
|
||||||
@ -1,167 +0,0 @@ |
|||||||
#!/usr/bin/env node
|
|
||||||
|
|
||||||
// Test script to verify websocket connections are not closed prematurely
|
|
||||||
// This is a Node.js test script that can be run with: node test-relay-connection.js
|
|
||||||
|
|
||||||
import { NostrWebSocket } from '@nostr-dev-kit/ndk'; |
|
||||||
|
|
||||||
const RELAY = process.env.RELAY || 'ws://localhost:8080'; |
|
||||||
const MAX_CONNECTIONS = 10; |
|
||||||
const TEST_DURATION = 30000; // 30 seconds
|
|
||||||
|
|
||||||
let connectionsClosed = 0; |
|
||||||
let connectionsOpened = 0; |
|
||||||
let messagesReceived = 0; |
|
||||||
let errors = 0; |
|
||||||
|
|
||||||
const stats = { |
|
||||||
premature: 0, |
|
||||||
normal: 0, |
|
||||||
errors: 0, |
|
||||||
}; |
|
||||||
|
|
||||||
class TestConnection { |
|
||||||
constructor(id) { |
|
||||||
this.id = id; |
|
||||||
this.ws = null; |
|
||||||
this.closed = false; |
|
||||||
this.openTime = null; |
|
||||||
this.closeTime = null; |
|
||||||
this.lastError = null; |
|
||||||
} |
|
||||||
|
|
||||||
connect() { |
|
||||||
return new Promise((resolve, reject) => { |
|
||||||
this.ws = new NostrWebSocket(RELAY); |
|
||||||
|
|
||||||
this.ws.addEventListener('open', () => { |
|
||||||
this.openTime = Date.now(); |
|
||||||
connectionsOpened++; |
|
||||||
console.log(`[Connection ${this.id}] Opened`); |
|
||||||
resolve(); |
|
||||||
}); |
|
||||||
|
|
||||||
this.ws.addEventListener('close', (event) => { |
|
||||||
this.closeTime = Date.now(); |
|
||||||
this.closed = true; |
|
||||||
connectionsClosed++; |
|
||||||
const duration = this.closeTime - this.openTime; |
|
||||||
console.log(`[Connection ${this.id}] Closed: code=${event.code}, reason="${event.reason || ''}", duration=${duration}ms`); |
|
||||||
|
|
||||||
if (duration < 5000 && event.code !== 1000) { |
|
||||||
stats.premature++; |
|
||||||
console.log(`[Connection ${this.id}] PREMATURE CLOSE DETECTED: duration=${duration}ms < 5s`); |
|
||||||
} else { |
|
||||||
stats.normal++; |
|
||||||
} |
|
||||||
}); |
|
||||||
|
|
||||||
this.ws.addEventListener('error', (error) => { |
|
||||||
this.lastError = error; |
|
||||||
stats.errors++; |
|
||||||
console.error(`[Connection ${this.id}] Error:`, error); |
|
||||||
}); |
|
||||||
|
|
||||||
this.ws.addEventListener('message', (event) => { |
|
||||||
messagesReceived++; |
|
||||||
try { |
|
||||||
const data = JSON.parse(event.data); |
|
||||||
console.log(`[Connection ${this.id}] Message:`, data[0]); |
|
||||||
} catch (e) { |
|
||||||
console.log(`[Connection ${this.id}] Message (non-JSON):`, event.data); |
|
||||||
} |
|
||||||
}); |
|
||||||
|
|
||||||
setTimeout(reject, 5000); // Timeout after 5 seconds if not opened
|
|
||||||
}); |
|
||||||
} |
|
||||||
|
|
||||||
sendReq() { |
|
||||||
if (this.ws && !this.closed) { |
|
||||||
this.ws.send(JSON.stringify(['REQ', `test-sub-${this.id}`, { kinds: [1], limit: 10 }])); |
|
||||||
console.log(`[Connection ${this.id}] Sent REQ`); |
|
||||||
} |
|
||||||
} |
|
||||||
|
|
||||||
close() { |
|
||||||
if (this.ws && !this.closed) { |
|
||||||
this.ws.close(); |
|
||||||
} |
|
||||||
} |
|
||||||
} |
|
||||||
|
|
||||||
async function runTest() { |
|
||||||
console.log('='.repeat(60)); |
|
||||||
console.log('Testing Relay Connection Stability'); |
|
||||||
console.log('='.repeat(60)); |
|
||||||
console.log(`Relay: ${RELAY}`); |
|
||||||
console.log(`Duration: ${TEST_DURATION}ms`); |
|
||||||
console.log(`Connections: ${MAX_CONNECTIONS}`); |
|
||||||
console.log('='.repeat(60)); |
|
||||||
console.log(); |
|
||||||
|
|
||||||
const connections = []; |
|
||||||
|
|
||||||
// Open connections
|
|
||||||
console.log('Opening connections...'); |
|
||||||
for (let i = 0; i < MAX_CONNECTIONS; i++) { |
|
||||||
const conn = new TestConnection(i); |
|
||||||
try { |
|
||||||
await conn.connect(); |
|
||||||
connections.push(conn); |
|
||||||
} catch (error) { |
|
||||||
console.error(`Failed to open connection ${i}:`, error); |
|
||||||
} |
|
||||||
} |
|
||||||
|
|
||||||
console.log(`Opened ${connections.length} connections`); |
|
||||||
console.log(); |
|
||||||
|
|
||||||
// Send requests from each connection
|
|
||||||
console.log('Sending REQ messages...'); |
|
||||||
for (const conn of connections) { |
|
||||||
conn.sendReq(); |
|
||||||
} |
|
||||||
|
|
||||||
// Wait and let connections run
|
|
||||||
console.log(`Waiting ${TEST_DURATION / 1000}s...`); |
|
||||||
await new Promise(resolve => setTimeout(resolve, TEST_DURATION)); |
|
||||||
|
|
||||||
// Close all connections
|
|
||||||
console.log('Closing all connections...'); |
|
||||||
for (const conn of connections) { |
|
||||||
conn.close(); |
|
||||||
} |
|
||||||
|
|
||||||
// Wait for close events
|
|
||||||
await new Promise(resolve => setTimeout(resolve, 1000)); |
|
||||||
|
|
||||||
// Print results
|
|
||||||
console.log(); |
|
||||||
console.log('='.repeat(60)); |
|
||||||
console.log('Test Results:'); |
|
||||||
console.log('='.repeat(60)); |
|
||||||
console.log(`Connections Opened: ${connectionsOpened}`); |
|
||||||
console.log(`Connections Closed: ${connectionsClosed}`); |
|
||||||
console.log(`Messages Received: ${messagesReceived}`); |
|
||||||
console.log(); |
|
||||||
console.log('Closure Analysis:'); |
|
||||||
console.log(`- Premature Closes: ${stats.premature}`); |
|
||||||
console.log(`- Normal Closes: ${stats.normal}`); |
|
||||||
console.log(`- Errors: ${stats.errors}`); |
|
||||||
console.log('='.repeat(60)); |
|
||||||
|
|
||||||
if (stats.premature > 0) { |
|
||||||
console.error('FAILED: Detected premature connection closures!'); |
|
||||||
process.exit(1); |
|
||||||
} else { |
|
||||||
console.log('PASSED: No premature connection closures detected.'); |
|
||||||
process.exit(0); |
|
||||||
} |
|
||||||
} |
|
||||||
|
|
||||||
runTest().catch(error => { |
|
||||||
console.error('Test failed:', error); |
|
||||||
process.exit(1); |
|
||||||
}); |
|
||||||
|
|
||||||
@ -1,57 +0,0 @@ |
|||||||
import { NostrWebSocket } from '@nostr-dev-kit/ndk'; |
|
||||||
|
|
||||||
const RELAY = process.env.RELAY || 'ws://localhost:8080'; |
|
||||||
|
|
||||||
async function testConnectionClosure() { |
|
||||||
console.log('Testing websocket connection closure issues...'); |
|
||||||
console.log('Connecting to:', RELAY); |
|
||||||
|
|
||||||
// Create multiple connections to test concurrency
|
|
||||||
const connections = []; |
|
||||||
const results = { connected: 0, closed: 0, errors: 0 }; |
|
||||||
|
|
||||||
for (let i = 0; i < 5; i++) { |
|
||||||
const ws = new NostrWebSocket(RELAY); |
|
||||||
|
|
||||||
ws.addEventListener('open', () => { |
|
||||||
console.log(`Connection ${i} opened`); |
|
||||||
results.connected++; |
|
||||||
}); |
|
||||||
|
|
||||||
ws.addEventListener('close', (event) => { |
|
||||||
console.log(`Connection ${i} closed:`, event.code, event.reason); |
|
||||||
results.closed++; |
|
||||||
}); |
|
||||||
|
|
||||||
ws.addEventListener('error', (error) => { |
|
||||||
console.error(`Connection ${i} error:`, error); |
|
||||||
results.errors++; |
|
||||||
}); |
|
||||||
|
|
||||||
connections.push(ws); |
|
||||||
} |
|
||||||
|
|
||||||
// Wait a bit then send REQs
|
|
||||||
await new Promise(resolve => setTimeout(resolve, 1000)); |
|
||||||
|
|
||||||
// Send some REQ messages
|
|
||||||
for (const ws of connections) { |
|
||||||
ws.send(JSON.stringify(['REQ', 'test-sub', { kinds: [1] }])); |
|
||||||
} |
|
||||||
|
|
||||||
// Wait and observe behavior
|
|
||||||
await new Promise(resolve => setTimeout(resolve, 5000)); |
|
||||||
|
|
||||||
console.log('\nTest Results:'); |
|
||||||
console.log(`- Connected: ${results.connected}`); |
|
||||||
console.log(`- Closed prematurely: ${results.closed}`); |
|
||||||
console.log(`- Errors: ${results.errors}`); |
|
||||||
|
|
||||||
// Close all connections
|
|
||||||
for (const ws of connections) { |
|
||||||
ws.close(); |
|
||||||
} |
|
||||||
} |
|
||||||
|
|
||||||
testConnectionClosure().catch(console.error); |
|
||||||
|
|
||||||
@ -1,156 +0,0 @@ |
|||||||
package main |
|
||||||
|
|
||||||
import ( |
|
||||||
"fmt" |
|
||||||
"time" |
|
||||||
|
|
||||||
"next.orly.dev/app/config" |
|
||||||
"next.orly.dev/pkg/run" |
|
||||||
) |
|
||||||
|
|
||||||
// func TestDumbClientWorkaround(t *testing.T) {
|
|
||||||
// var relay *run.Relay
|
|
||||||
// var err error
|
|
||||||
|
|
||||||
// // Start local relay for testing
|
|
||||||
// if relay, _, err = startWorkaroundTestRelay(); err != nil {
|
|
||||||
// t.Fatalf("Failed to start test relay: %v", err)
|
|
||||||
// }
|
|
||||||
// defer func() {
|
|
||||||
// if stopErr := relay.Stop(); stopErr != nil {
|
|
||||||
// t.Logf("Error stopping relay: %v", stopErr)
|
|
||||||
// }
|
|
||||||
// }()
|
|
||||||
|
|
||||||
// relayURL := "ws://127.0.0.1:3338"
|
|
||||||
|
|
||||||
// // Wait for relay to be ready
|
|
||||||
// if err = waitForRelay(relayURL, 10*time.Second); err != nil {
|
|
||||||
// t.Fatalf("Relay not ready after timeout: %v", err)
|
|
||||||
// }
|
|
||||||
|
|
||||||
// t.Logf("Relay is ready at %s", relayURL)
|
|
||||||
|
|
||||||
// // Test connection with a "dumb" client that doesn't handle ping/pong properly
|
|
||||||
// dialer := websocket.Dialer{
|
|
||||||
// HandshakeTimeout: 10 * time.Second,
|
|
||||||
// }
|
|
||||||
|
|
||||||
// conn, _, err := dialer.Dial(relayURL, nil)
|
|
||||||
// if err != nil {
|
|
||||||
// t.Fatalf("Failed to connect: %v", err)
|
|
||||||
// }
|
|
||||||
// defer conn.Close()
|
|
||||||
|
|
||||||
// t.Logf("Connection established")
|
|
||||||
|
|
||||||
// // Simulate a dumb client that sets a short read deadline and doesn't handle ping/pong
|
|
||||||
// conn.SetReadDeadline(time.Now().Add(30 * time.Second))
|
|
||||||
|
|
||||||
// startTime := time.Now()
|
|
||||||
// messageCount := 0
|
|
||||||
|
|
||||||
// // The connection should stay alive despite the short client-side deadline
|
|
||||||
// // because our workaround sets a 24-hour server-side deadline
|
|
||||||
// connectionFailed := false
|
|
||||||
// for time.Since(startTime) < 2*time.Minute && !connectionFailed {
|
|
||||||
// // Extend client deadline every 10 seconds (simulating dumb client behavior)
|
|
||||||
// if time.Since(startTime).Seconds() > 10 && int(time.Since(startTime).Seconds())%10 == 0 {
|
|
||||||
// conn.SetReadDeadline(time.Now().Add(30 * time.Second))
|
|
||||||
// t.Logf("Dumb client extended its own deadline")
|
|
||||||
// }
|
|
||||||
|
|
||||||
// // Try to read with a short timeout to avoid blocking
|
|
||||||
// conn.SetReadDeadline(time.Now().Add(1 * time.Second))
|
|
||||||
|
|
||||||
// // Use a function to catch panics from ReadMessage on failed connections
|
|
||||||
// func() {
|
|
||||||
// defer func() {
|
|
||||||
// if r := recover(); r != nil {
|
|
||||||
// if panicMsg, ok := r.(string); ok && panicMsg == "repeated read on failed websocket connection" {
|
|
||||||
// t.Logf("Connection failed, stopping read loop")
|
|
||||||
// connectionFailed = true
|
|
||||||
// return
|
|
||||||
// }
|
|
||||||
// // Re-panic if it's a different panic
|
|
||||||
// panic(r)
|
|
||||||
// }
|
|
||||||
// }()
|
|
||||||
|
|
||||||
// msgType, data, err := conn.ReadMessage()
|
|
||||||
// conn.SetReadDeadline(time.Now().Add(30 * time.Second)) // Reset
|
|
||||||
|
|
||||||
// if err != nil {
|
|
||||||
// if netErr, ok := err.(net.Error); ok && netErr.Timeout() {
|
|
||||||
// // Timeout is expected - just continue
|
|
||||||
// time.Sleep(100 * time.Millisecond)
|
|
||||||
// return
|
|
||||||
// }
|
|
||||||
// if websocket.IsCloseError(err, websocket.CloseNormalClosure, websocket.CloseGoingAway) {
|
|
||||||
// t.Logf("Connection closed normally: %v", err)
|
|
||||||
// connectionFailed = true
|
|
||||||
// return
|
|
||||||
// }
|
|
||||||
// t.Errorf("Unexpected error: %v", err)
|
|
||||||
// connectionFailed = true
|
|
||||||
// return
|
|
||||||
// }
|
|
||||||
|
|
||||||
// messageCount++
|
|
||||||
// t.Logf("Received message %d: type=%d, len=%d", messageCount, msgType, len(data))
|
|
||||||
// }()
|
|
||||||
// }
|
|
||||||
|
|
||||||
// elapsed := time.Since(startTime)
|
|
||||||
// if elapsed < 90*time.Second {
|
|
||||||
// t.Errorf("Connection died too early after %v (expected at least 90s)", elapsed)
|
|
||||||
// } else {
|
|
||||||
// t.Logf("Workaround successful: connection lasted %v with %d messages", elapsed, messageCount)
|
|
||||||
// }
|
|
||||||
// }
|
|
||||||
|
|
||||||
// startWorkaroundTestRelay starts a relay for workaround testing
|
|
||||||
func startWorkaroundTestRelay() (relay *run.Relay, port int, err error) { |
|
||||||
cfg := &config.C{ |
|
||||||
AppName: "ORLY-WORKAROUND-TEST", |
|
||||||
DataDir: "", |
|
||||||
Listen: "127.0.0.1", |
|
||||||
Port: 3338, |
|
||||||
HealthPort: 0, |
|
||||||
EnableShutdown: false, |
|
||||||
LogLevel: "info", |
|
||||||
DBLogLevel: "warn", |
|
||||||
DBBlockCacheMB: 512, |
|
||||||
DBIndexCacheMB: 256, |
|
||||||
LogToStdout: false, |
|
||||||
PprofHTTP: false, |
|
||||||
ACLMode: "none", |
|
||||||
AuthRequired: false, |
|
||||||
AuthToWrite: false, |
|
||||||
SubscriptionEnabled: false, |
|
||||||
MonthlyPriceSats: 6000, |
|
||||||
FollowListFrequency: time.Hour, |
|
||||||
WebDisableEmbedded: false, |
|
||||||
SprocketEnabled: false, |
|
||||||
SpiderMode: "none", |
|
||||||
PolicyEnabled: false, |
|
||||||
} |
|
||||||
|
|
||||||
// Set default data dir if not specified
|
|
||||||
if cfg.DataDir == "" { |
|
||||||
cfg.DataDir = fmt.Sprintf("/tmp/orly-workaround-test-%d", time.Now().UnixNano()) |
|
||||||
} |
|
||||||
|
|
||||||
// Create options
|
|
||||||
cleanup := true |
|
||||||
opts := &run.Options{ |
|
||||||
CleanupDataDir: &cleanup, |
|
||||||
} |
|
||||||
|
|
||||||
// Start relay
|
|
||||||
if relay, err = run.Start(cfg, opts); err != nil { |
|
||||||
return nil, 0, fmt.Errorf("failed to start relay: %w", err) |
|
||||||
} |
|
||||||
|
|
||||||
return relay, cfg.Port, nil |
|
||||||
} |
|
||||||
Loading…
Reference in new issue