Skip to content

Commit acf073c

Browse files
committed
draft of remora blog post
1 parent b1b7ff7 commit acf073c

7 files changed

Lines changed: 298 additions & 0 deletions

File tree

src/content/blog/remora.mdx

Lines changed: 284 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,284 @@
1+
---
2+
title: Remora
3+
description: A distributed web crawler.
4+
tags:
5+
- web
6+
- tech
7+
draft: true
8+
pubDate: 2025-04-29T00:29:53.013Z
9+
slug: remora
10+
---
11+
12+
import { Image } from "astro:assets";
13+
import RemoraSharkWebp from "~/img/remora/remora-and-shark.webp";
14+
15+
<div
16+
style="display:block;margin-left:auto;margin-right:auto;width:80%;"
17+
>
18+
<a
19+
href="https://www.treehugger.com/remora-fish-suckers-sea-inspiring-new-adhesives-4858201"
20+
target="_blank"
21+
>
22+
<Image
23+
src={RemoraSharkWebp}
24+
alt="Remora Fish under a shark"
25+
width="600"
26+
height="400"
27+
/>
28+
</a>
29+
<a
30+
href="https://www.treehugger.com/remora-fish-suckers-sea-inspiring-new-adhesives-4858201"
31+
title="Remora Fish, Those Suckers of the Sea, Are Inspiring New Adhesives"
32+
>
33+
<em style="font-size:11px">photo credit</em>
34+
</a>
35+
</div>
36+
37+
A remora is a type of fish that lives in symbiosis with another larger whale or
38+
shark that will remove parasites and clean dead skin but benefits from the
39+
protection of the larger fish. A web crawler crawls the internet, indexing sites
40+
and pruning dead pages. Much like the remora, a web crawler lives along side the
41+
internet as a mutually beneficial organism.
42+
43+
> See the [project page](/projects/remora).
44+
45+
# Inspiration
46+
47+
* **Data collection**. The data scientist in me wants to hoover up all the data.
48+
* **Bookmarks search**. I hate my browser's bookmarks and I want a searchable
49+
bookmark tool.
50+
* **General usefulness**. Having a custom sophisticated webcrawler seems useful
51+
for many reasons.
52+
* **System design**. This is the dumbest reason, but system design interviews seem
53+
hard and I wanted hands on experience to inform my studying.
54+
55+
# Design Requirements/Considerations
56+
57+
- Must be able to infinitely crawl a site.
58+
- Politeness
59+
- don't crawl too fast
60+
- respect `robots.txt` (you're IP **will** get blocked)
61+
- Handle and store graphs with a relatively high degree.
62+
- Will be heavily reliant on routable queuing
63+
- Custom DNS Resolver[^dns]
64+
65+
### MVP (short term goals)
66+
67+
My short term goals for this project were fairly limited mostly because of time
68+
but also because there's a ton of possible digressions that distract from
69+
implementing the core functionality.
70+
71+
1. Halfway usable search.
72+
2. Page rank.
73+
74+
# Design
75+
76+
![system design diagram](~/img/remora/basic-architecture.svg)
77+
78+
The core components of this design are the crawlers and the frontier queue. Here
79+
the "Crawler API" is a pretty abstract representation of the breadth of possible
80+
architectures that the crawler flow can have.
81+
82+
### Crawler Flow
83+
![crawler workflow diagram](~/img/remora/crawler-flow.svg)
84+
85+
In the simplest case it is just a single thread that handles each stage and in
86+
the most complex case, it can be a series of micro services.
87+
88+
### Frontier Queue
89+
![frontier queue design diagram](~/img/remora/frontier.svg)
90+
91+
This is a simplified version the [IR book's][1] frontier queue diagram. Here,
92+
the priority and routing is simplified because most of that is handled by my
93+
decision to RabbitMQ which handles most of the routing and prioritization.
94+
95+
96+
# Extensibility
97+
In my inspiration section, I said I wanted a web crawler for "general
98+
usefulness" which is a pretty open ended requirement. This necessitates a design
99+
that can be easily collapsible or expandable and for that I designed a `Visitor`
100+
go interface for objects that would eventually have access to a crawled URL.
101+
102+
```go
103+
type Visitor interface {
104+
// Filter is called after checking page depth
105+
// and after checking for a repeated URL.
106+
Filter(*PageRequest, *url.URL) error
107+
// Visit is called after a page is fetched.
108+
Visit(context.Context, *Page)
109+
}
110+
```
111+
112+
Using this visitor pattern allowed me to abstract the pipelined operations that
113+
needed to be applied to each URL pulled from the queue.
114+
115+
This allows a useful degree of compostability and can be used to easily split
116+
sections up into separate micro services.
117+
118+
Consider this example of chaining multiple visitors.
119+
120+
```go
121+
type Visitors []Visitor
122+
123+
func (vs Visitors) Filter(p *PageRequest, u *url.URL) (err error) {
124+
return Map(vs, func(v Visitor) error { return v.Filter(p, u) })
125+
}
126+
127+
func (vs Visitors) Visit(ctx context.Context, p *Page) {
128+
Map(vs, func(v Visitor) { v.Visit(ctx, p) })
129+
}
130+
```
131+
132+
133+
# Observability
134+
135+
### Logs, Metrics, Traces
136+
137+
The big three for observability are necessary for any sufficiently large
138+
project and this project is no different. However, one assumption I made was
139+
that I didn't need super high accuracy metrics so the timing information embedded
140+
in traces was sufficient.
141+
142+
For logging I kept it simple, just JSON or logfmt to stdout. For traces I used
143+
the opentelemetry libraries to send traces to
144+
[jaeger](https://www.jaegertracing.io/). This allowed me to have good visibility
145+
into the behavior and performance of the system no matter how distributed it
146+
became.
147+
148+
149+
# Future Improvements
150+
None of my projects are finished unfortunately and this one is probably one of
151+
the most unfinished simple due to the number of improvements and configuration
152+
options I want to add.
153+
154+
### Future Architecture Improvements
155+
156+
As previously mentioned, using a visitor-like pattern makes it pretty easy to
157+
extend the architecture by adding visitors that make calls out to other
158+
microservices.
159+
160+
Here's one option for how the architecture could be expanded to handle different
161+
file types, support scaleable filtering, and re-crawl old pages with a cron job.
162+
163+
![diagram for future system design plans](~/img/remora/future-architecture.svg)
164+
165+
166+
### Future Tech Improvements
167+
1. Implement Near-Duplicate detection.
168+
169+
This one is a pretty simple, self-explanatory feature that would reduce the
170+
number of nodes to visit. This improves both the storage and time efficiency
171+
of the crawler. Google published [a paper][5] documenting near-duplicate
172+
detection for web crawlers that I would probably follow closely.
173+
174+
2. Swap out the queue.
175+
176+
I decided to use RabbitMQ which was for the most part a good decision
177+
however, because of the nature of the web, each node in the graph has a high
178+
degree and causes each queue to build up really fast. For example, the
179+
average Wikipedia page has about 100 links. RabbitMQ will only persist
180+
queues to disk periodically if they start to get too big. This means that if
181+
you are crawling large sites the memory usage of the queue explodes quite
182+
fast. When crawling Wikipedia, the size of the queue peeked at around 50
183+
million messages.
184+
185+
One option here is to use [Kafka](https://kafka.apache.org/) which has its
186+
own downsides (routing will get more difficult) but it is designed to handle
187+
higher throughput.
188+
189+
The last option is to add messages to a simple database (probably postgres
190+
or sqlite3) and keep track of each end of the queue as you push and pop.
191+
This would allow **all** messages to be persisted to disk to prevent
192+
ballooning memory usage. This option, like Kafka, will require more bespoke
193+
solutions to emulate the routing features of RabbitMQ. The downside is that
194+
reading everything from disk is going to be pretty slow although you can
195+
amortize this performance reduction by reading blocks of entries from the
196+
front of the queue.
197+
198+
3. Use better database(s).
199+
200+
I chose to use Postgres to store page information which ended up being my
201+
biggest performance bottleneck even when the crawl speed is throttled.
202+
203+
Postgres is great as a general use database but I found myself misusing it
204+
in a couple ways. First, the built-in full-text search feature is awesome
205+
for throwing together a MVP but it quickly slowed down on large sites.
206+
After storing all of Wikipedia, a simple page rank query was on the order of
207+
minutes which is obviously not ideal.
208+
209+
The best option for text search is to sprinkle in some
210+
[elasticsearch](https://www.elastic.co/elasticsearch) which is a popular
211+
solution for text search[^discord-elasticsearch]. Another option is to write
212+
a custom full-text search engine which is a fun project but a pretty big
213+
lift.
214+
215+
I also made the decision to store the raw text and page info in postgres
216+
which makes the table huge and database scans ridiculously slow. There is
217+
really no reason to implement it this way and there are a few better options
218+
like storing page info in a database that is better at clustering like
219+
[Cassandra](https://cassandra.apache.org/) or
220+
[MongoDB](https://www.mongodb.com/) and storing raw text in an object store
221+
like [S3](https://aws.amazon.com/s3/) or [Minio](https://min.io/).
222+
223+
4. Throw it in kubernetes.
224+
225+
I wasted a ton of time implementing my own docker-compose that is better at
226+
deploying many containers that have a slightly different configuration.
227+
Having learned kubernetes after doing the bulk of the work on this project,
228+
I would improve it by using kubernetes primitives to deploy the various
229+
parts of the system and to handle replication of the core pieces. I can also
230+
see value in implementing a custom kubernetes operator for managing the
231+
queue and replicated crawlers since the lines between configuration and
232+
infrastructure in this project are fairly blurred.
233+
234+
235+
# Notable Digressions
236+
237+
### Custom Search
238+
239+
Custom Tf-Idf[^tf-idf] vectorized search tooling. This is a pretty big project
240+
and I have not finished it yet. Too see my (pretty slow) progress see my github
241+
repo [harrybrwn/ts](https://github.com/harrybrwn/ts).
242+
243+
Pros:
244+
* Super fine-grain control over ranking.
245+
* Performance issues are my fault (can make assumptions about the structure of
246+
search space).
247+
248+
Cons:
249+
* Its a lot of work and elasticsearch is probably better anyway.
250+
251+
### Orchestration
252+
253+
I built a custom container orchestration tool for this project. It was way too
254+
overkill but docker-compose didn't work very well and I hadn't learned
255+
kubernetes yet.
256+
257+
I've also gotten caught up in creating configuration APIs for adding or
258+
removing site crawlers in new threads. This is something that could be important
259+
in the future when I have the crawler sitting around and want to crawl a site
260+
on-demand but in the current unfinished state, its not super useful.
261+
262+
263+
# Sources
264+
265+
1. The [Information Retrieval][1] textbook.
266+
2. Heading image from [treehugger.com][4].
267+
3. I have a list of other resources that might be useful in my [remora notes page][6].
268+
269+
[^IR]: https://nlp.stanford.edu/IR-book/
270+
[^tf-idf]: Tf-idf is a document scoring algorithm seen in the [IR][1] chapter 6.
271+
[^dns]: DNS resolution is a common bottleneck for web crawlers for a variety of
272+
reasons described in [IR section 20.2.2][3]. I implemented a custom resolver
273+
for this project but did not see any measurable performance improvements so
274+
I took it out.
275+
[^discord-elasticsearch]: Elasticsearch is used by discord to index trillions of
276+
messages. They describe this in an [engineering blog post](https://discord.com/blog/how-discord-indexes-trillions-of-messages).
277+
278+
[1]: <https://nlp.stanford.edu/IR-book/> "Introduction to Information Retrieval"
279+
[2]: <https://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html> "IR Tf-idf weighting"
280+
[3]: <https://nlp.stanford.edu/IR-book/html/htmledition/dns-resolution-1.html> "IR DNS Resolver"
281+
[4]: <https://www.treehugger.com/remora-fish-suckers-sea-inspiring-new-adhesives-4858201> "Remora Fish, Those Suckers of the Sea, Are Inspiring New Adhesives"
282+
[5]: <https://dl.acm.org/doi/10.1145/1242572.1242592> "Detecting Near-Duplicates for Web Crawling"
283+
[6]: </remora/notes> "My Crawler Project Notes"
284+

src/content/config.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ const blog = defineCollection({
1616
pubDate: z.date(),
1717
modDate: z.date().optional(),
1818
blog: z.boolean().optional(),
19+
image: z.string().optional(),
1920
draft: z.boolean().optional(),
2021
}),
2122
});

src/img/remora/basic-architecture.svg

Lines changed: 3 additions & 0 deletions
Loading

src/img/remora/crawler-flow.svg

Lines changed: 3 additions & 0 deletions
Loading

src/img/remora/frontier.svg

Lines changed: 3 additions & 0 deletions
Loading

src/img/remora/future-architecture.svg

Lines changed: 3 additions & 0 deletions
Loading

src/pages/projects/remora/notes.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ layout: ~/layouts/Basic.astro
1313
- [Overview of various webcrawler architectures](https://www.microsoft.com/en-us/research/wp-content/uploads/2009/09/EDS-WebCrawlerArchitecture.pdf)
1414
- [Estimating page freshness](https://www.youtube.com/watch?v=qrBXI_hrWrI)
1515
- [Compaq Research Paper](https://www.cs.cornell.edu/courses/cs685/2002fa/mercator.pdf) with a [video presentation](https://www.youtube.com/watch?v=i5qLt0ShJSg)
16+
- [Web Crawler Architecture](https://marc.najork.org/papers/eds2009a.pdf), Marc Najork
1617

1718

1819
# Queues

0 commit comments

Comments
 (0)