|
| 1 | +--- |
| 2 | +title: Remora |
| 3 | +description: A distributed web crawler. |
| 4 | +tags: |
| 5 | +- web |
| 6 | +- tech |
| 7 | +draft: true |
| 8 | +pubDate: 2025-04-29T00:29:53.013Z |
| 9 | +slug: remora |
| 10 | +--- |
| 11 | + |
| 12 | +import { Image } from "astro:assets"; |
| 13 | +import RemoraSharkWebp from "~/img/remora/remora-and-shark.webp"; |
| 14 | + |
| 15 | +<div |
| 16 | + style="display:block;margin-left:auto;margin-right:auto;width:80%;" |
| 17 | +> |
| 18 | + <a |
| 19 | + href="https://www.treehugger.com/remora-fish-suckers-sea-inspiring-new-adhesives-4858201" |
| 20 | + target="_blank" |
| 21 | + > |
| 22 | + <Image |
| 23 | + src={RemoraSharkWebp} |
| 24 | + alt="Remora Fish under a shark" |
| 25 | + width="600" |
| 26 | + height="400" |
| 27 | + /> |
| 28 | + </a> |
| 29 | + <a |
| 30 | + href="https://www.treehugger.com/remora-fish-suckers-sea-inspiring-new-adhesives-4858201" |
| 31 | + title="Remora Fish, Those Suckers of the Sea, Are Inspiring New Adhesives" |
| 32 | + > |
| 33 | + <em style="font-size:11px">photo credit</em> |
| 34 | + </a> |
| 35 | +</div> |
| 36 | + |
| 37 | +A remora is a type of fish that lives in symbiosis with another larger whale or |
| 38 | +shark that will remove parasites and clean dead skin but benefits from the |
| 39 | +protection of the larger fish. A web crawler crawls the internet, indexing sites |
| 40 | +and pruning dead pages. Much like the remora, a web crawler lives along side the |
| 41 | +internet as a mutually beneficial organism. |
| 42 | + |
| 43 | +> See the [project page](/projects/remora). |
| 44 | +
|
| 45 | +# Inspiration |
| 46 | + |
| 47 | +* **Data collection**. The data scientist in me wants to hoover up all the data. |
| 48 | +* **Bookmarks search**. I hate my browser's bookmarks and I want a searchable |
| 49 | + bookmark tool. |
| 50 | +* **General usefulness**. Having a custom sophisticated webcrawler seems useful |
| 51 | + for many reasons. |
| 52 | +* **System design**. This is the dumbest reason, but system design interviews seem |
| 53 | + hard and I wanted hands on experience to inform my studying. |
| 54 | + |
| 55 | +# Design Requirements/Considerations |
| 56 | + |
| 57 | +- Must be able to infinitely crawl a site. |
| 58 | +- Politeness |
| 59 | + - don't crawl too fast |
| 60 | + - respect `robots.txt` (you're IP **will** get blocked) |
| 61 | +- Handle and store graphs with a relatively high degree. |
| 62 | +- Will be heavily reliant on routable queuing |
| 63 | +- Custom DNS Resolver[^dns] |
| 64 | + |
| 65 | +### MVP (short term goals) |
| 66 | + |
| 67 | +My short term goals for this project were fairly limited mostly because of time |
| 68 | +but also because there's a ton of possible digressions that distract from |
| 69 | +implementing the core functionality. |
| 70 | + |
| 71 | +1. Halfway usable search. |
| 72 | +2. Page rank. |
| 73 | + |
| 74 | +# Design |
| 75 | + |
| 76 | + |
| 77 | + |
| 78 | +The core components of this design are the crawlers and the frontier queue. Here |
| 79 | +the "Crawler API" is a pretty abstract representation of the breadth of possible |
| 80 | +architectures that the crawler flow can have. |
| 81 | + |
| 82 | +### Crawler Flow |
| 83 | + |
| 84 | + |
| 85 | +In the simplest case it is just a single thread that handles each stage and in |
| 86 | +the most complex case, it can be a series of micro services. |
| 87 | + |
| 88 | +### Frontier Queue |
| 89 | + |
| 90 | + |
| 91 | +This is a simplified version the [IR book's][1] frontier queue diagram. Here, |
| 92 | +the priority and routing is simplified because most of that is handled by my |
| 93 | +decision to RabbitMQ which handles most of the routing and prioritization. |
| 94 | + |
| 95 | + |
| 96 | +# Extensibility |
| 97 | +In my inspiration section, I said I wanted a web crawler for "general |
| 98 | +usefulness" which is a pretty open ended requirement. This necessitates a design |
| 99 | +that can be easily collapsible or expandable and for that I designed a `Visitor` |
| 100 | +go interface for objects that would eventually have access to a crawled URL. |
| 101 | + |
| 102 | +```go |
| 103 | +type Visitor interface { |
| 104 | + // Filter is called after checking page depth |
| 105 | + // and after checking for a repeated URL. |
| 106 | + Filter(*PageRequest, *url.URL) error |
| 107 | + // Visit is called after a page is fetched. |
| 108 | + Visit(context.Context, *Page) |
| 109 | +} |
| 110 | +``` |
| 111 | + |
| 112 | +Using this visitor pattern allowed me to abstract the pipelined operations that |
| 113 | +needed to be applied to each URL pulled from the queue. |
| 114 | + |
| 115 | +This allows a useful degree of compostability and can be used to easily split |
| 116 | +sections up into separate micro services. |
| 117 | + |
| 118 | +Consider this example of chaining multiple visitors. |
| 119 | + |
| 120 | +```go |
| 121 | +type Visitors []Visitor |
| 122 | + |
| 123 | +func (vs Visitors) Filter(p *PageRequest, u *url.URL) (err error) { |
| 124 | + return Map(vs, func(v Visitor) error { return v.Filter(p, u) }) |
| 125 | +} |
| 126 | + |
| 127 | +func (vs Visitors) Visit(ctx context.Context, p *Page) { |
| 128 | + Map(vs, func(v Visitor) { v.Visit(ctx, p) }) |
| 129 | +} |
| 130 | +``` |
| 131 | + |
| 132 | + |
| 133 | +# Observability |
| 134 | + |
| 135 | +### Logs, Metrics, Traces |
| 136 | + |
| 137 | +The big three for observability are necessary for any sufficiently large |
| 138 | +project and this project is no different. However, one assumption I made was |
| 139 | +that I didn't need super high accuracy metrics so the timing information embedded |
| 140 | +in traces was sufficient. |
| 141 | + |
| 142 | +For logging I kept it simple, just JSON or logfmt to stdout. For traces I used |
| 143 | +the opentelemetry libraries to send traces to |
| 144 | +[jaeger](https://www.jaegertracing.io/). This allowed me to have good visibility |
| 145 | +into the behavior and performance of the system no matter how distributed it |
| 146 | +became. |
| 147 | + |
| 148 | + |
| 149 | +# Future Improvements |
| 150 | +None of my projects are finished unfortunately and this one is probably one of |
| 151 | +the most unfinished simple due to the number of improvements and configuration |
| 152 | +options I want to add. |
| 153 | + |
| 154 | +### Future Architecture Improvements |
| 155 | + |
| 156 | +As previously mentioned, using a visitor-like pattern makes it pretty easy to |
| 157 | +extend the architecture by adding visitors that make calls out to other |
| 158 | +microservices. |
| 159 | + |
| 160 | +Here's one option for how the architecture could be expanded to handle different |
| 161 | +file types, support scaleable filtering, and re-crawl old pages with a cron job. |
| 162 | + |
| 163 | + |
| 164 | + |
| 165 | + |
| 166 | +### Future Tech Improvements |
| 167 | +1. Implement Near-Duplicate detection. |
| 168 | + |
| 169 | + This one is a pretty simple, self-explanatory feature that would reduce the |
| 170 | + number of nodes to visit. This improves both the storage and time efficiency |
| 171 | + of the crawler. Google published [a paper][5] documenting near-duplicate |
| 172 | + detection for web crawlers that I would probably follow closely. |
| 173 | + |
| 174 | +2. Swap out the queue. |
| 175 | + |
| 176 | + I decided to use RabbitMQ which was for the most part a good decision |
| 177 | + however, because of the nature of the web, each node in the graph has a high |
| 178 | + degree and causes each queue to build up really fast. For example, the |
| 179 | + average Wikipedia page has about 100 links. RabbitMQ will only persist |
| 180 | + queues to disk periodically if they start to get too big. This means that if |
| 181 | + you are crawling large sites the memory usage of the queue explodes quite |
| 182 | + fast. When crawling Wikipedia, the size of the queue peeked at around 50 |
| 183 | + million messages. |
| 184 | + |
| 185 | + One option here is to use [Kafka](https://kafka.apache.org/) which has its |
| 186 | + own downsides (routing will get more difficult) but it is designed to handle |
| 187 | + higher throughput. |
| 188 | + |
| 189 | + The last option is to add messages to a simple database (probably postgres |
| 190 | + or sqlite3) and keep track of each end of the queue as you push and pop. |
| 191 | + This would allow **all** messages to be persisted to disk to prevent |
| 192 | + ballooning memory usage. This option, like Kafka, will require more bespoke |
| 193 | + solutions to emulate the routing features of RabbitMQ. The downside is that |
| 194 | + reading everything from disk is going to be pretty slow although you can |
| 195 | + amortize this performance reduction by reading blocks of entries from the |
| 196 | + front of the queue. |
| 197 | + |
| 198 | +3. Use better database(s). |
| 199 | + |
| 200 | + I chose to use Postgres to store page information which ended up being my |
| 201 | + biggest performance bottleneck even when the crawl speed is throttled. |
| 202 | + |
| 203 | + Postgres is great as a general use database but I found myself misusing it |
| 204 | + in a couple ways. First, the built-in full-text search feature is awesome |
| 205 | + for throwing together a MVP but it quickly slowed down on large sites. |
| 206 | + After storing all of Wikipedia, a simple page rank query was on the order of |
| 207 | + minutes which is obviously not ideal. |
| 208 | + |
| 209 | + The best option for text search is to sprinkle in some |
| 210 | + [elasticsearch](https://www.elastic.co/elasticsearch) which is a popular |
| 211 | + solution for text search[^discord-elasticsearch]. Another option is to write |
| 212 | + a custom full-text search engine which is a fun project but a pretty big |
| 213 | + lift. |
| 214 | + |
| 215 | + I also made the decision to store the raw text and page info in postgres |
| 216 | + which makes the table huge and database scans ridiculously slow. There is |
| 217 | + really no reason to implement it this way and there are a few better options |
| 218 | + like storing page info in a database that is better at clustering like |
| 219 | + [Cassandra](https://cassandra.apache.org/) or |
| 220 | + [MongoDB](https://www.mongodb.com/) and storing raw text in an object store |
| 221 | + like [S3](https://aws.amazon.com/s3/) or [Minio](https://min.io/). |
| 222 | + |
| 223 | +4. Throw it in kubernetes. |
| 224 | + |
| 225 | + I wasted a ton of time implementing my own docker-compose that is better at |
| 226 | + deploying many containers that have a slightly different configuration. |
| 227 | + Having learned kubernetes after doing the bulk of the work on this project, |
| 228 | + I would improve it by using kubernetes primitives to deploy the various |
| 229 | + parts of the system and to handle replication of the core pieces. I can also |
| 230 | + see value in implementing a custom kubernetes operator for managing the |
| 231 | + queue and replicated crawlers since the lines between configuration and |
| 232 | + infrastructure in this project are fairly blurred. |
| 233 | + |
| 234 | + |
| 235 | +# Notable Digressions |
| 236 | + |
| 237 | +### Custom Search |
| 238 | + |
| 239 | +Custom Tf-Idf[^tf-idf] vectorized search tooling. This is a pretty big project |
| 240 | +and I have not finished it yet. Too see my (pretty slow) progress see my github |
| 241 | +repo [harrybrwn/ts](https://github.com/harrybrwn/ts). |
| 242 | + |
| 243 | +Pros: |
| 244 | +* Super fine-grain control over ranking. |
| 245 | +* Performance issues are my fault (can make assumptions about the structure of |
| 246 | + search space). |
| 247 | + |
| 248 | +Cons: |
| 249 | +* Its a lot of work and elasticsearch is probably better anyway. |
| 250 | + |
| 251 | +### Orchestration |
| 252 | + |
| 253 | +I built a custom container orchestration tool for this project. It was way too |
| 254 | +overkill but docker-compose didn't work very well and I hadn't learned |
| 255 | +kubernetes yet. |
| 256 | + |
| 257 | +I've also gotten caught up in creating configuration APIs for adding or |
| 258 | +removing site crawlers in new threads. This is something that could be important |
| 259 | +in the future when I have the crawler sitting around and want to crawl a site |
| 260 | +on-demand but in the current unfinished state, its not super useful. |
| 261 | + |
| 262 | + |
| 263 | +# Sources |
| 264 | + |
| 265 | +1. The [Information Retrieval][1] textbook. |
| 266 | +2. Heading image from [treehugger.com][4]. |
| 267 | +3. I have a list of other resources that might be useful in my [remora notes page][6]. |
| 268 | + |
| 269 | +[^IR]: https://nlp.stanford.edu/IR-book/ |
| 270 | +[^tf-idf]: Tf-idf is a document scoring algorithm seen in the [IR][1] chapter 6. |
| 271 | +[^dns]: DNS resolution is a common bottleneck for web crawlers for a variety of |
| 272 | + reasons described in [IR section 20.2.2][3]. I implemented a custom resolver |
| 273 | + for this project but did not see any measurable performance improvements so |
| 274 | + I took it out. |
| 275 | +[^discord-elasticsearch]: Elasticsearch is used by discord to index trillions of |
| 276 | + messages. They describe this in an [engineering blog post](https://discord.com/blog/how-discord-indexes-trillions-of-messages). |
| 277 | + |
| 278 | +[1]: <https://nlp.stanford.edu/IR-book/> "Introduction to Information Retrieval" |
| 279 | +[2]: <https://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html> "IR Tf-idf weighting" |
| 280 | +[3]: <https://nlp.stanford.edu/IR-book/html/htmledition/dns-resolution-1.html> "IR DNS Resolver" |
| 281 | +[4]: <https://www.treehugger.com/remora-fish-suckers-sea-inspiring-new-adhesives-4858201> "Remora Fish, Those Suckers of the Sea, Are Inspiring New Adhesives" |
| 282 | +[5]: <https://dl.acm.org/doi/10.1145/1242572.1242592> "Detecting Near-Duplicates for Web Crawling" |
| 283 | +[6]: </remora/notes> "My Crawler Project Notes" |
| 284 | + |
0 commit comments