Skip to content

Commit 5c772b0

Browse files
committed
stop excessive linking
1 parent ec170c3 commit 5c772b0

1 file changed

Lines changed: 12 additions & 16 deletions

File tree

src/content/blog/remora.mdx

Lines changed: 12 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -140,10 +140,9 @@ that I didn't need super high accuracy metrics so the timing information embedde
140140
in traces was sufficient.
141141

142142
For logging I kept it simple, just JSON or logfmt to stdout. For traces I used
143-
the opentelemetry libraries to send traces to
144-
[jaeger](https://www.jaegertracing.io/). This allowed me to have good visibility
145-
into the behavior and performance of the system no matter how distributed it
146-
became.
143+
the opentelemetry libraries to send traces to jaeger. This allowed me to have
144+
good visibility into the behavior and performance of the system no matter how
145+
distributed it became.
147146

148147

149148
# Future Improvements
@@ -182,9 +181,8 @@ file types, support scaleable filtering, and re-crawl old pages with a cron job.
182181
fast. When crawling Wikipedia, the size of the queue peeked at around 50
183182
million messages.
184183

185-
One option here is to use [Kafka](https://kafka.apache.org/) which has its
186-
own downsides (routing will get more difficult) but it is designed to handle
187-
higher throughput.
184+
One option here is to use Kafka which has its own downsides (routing will
185+
get more difficult) but it is designed to handle higher throughput.
188186

189187
The last option is to add messages to a simple database (probably postgres
190188
or sqlite3) and keep track of each end of the queue as you push and pop.
@@ -206,19 +204,17 @@ file types, support scaleable filtering, and re-crawl old pages with a cron job.
206204
After storing all of Wikipedia, a simple page rank query was on the order of
207205
minutes which is obviously not ideal.
208206

209-
The best option for text search is to sprinkle in some
210-
[elasticsearch](https://www.elastic.co/elasticsearch) which is a popular
211-
solution for text search[^discord-elasticsearch]. Another option is to write
212-
a custom full-text search engine which is a fun project but a pretty big
213-
lift.
207+
The best option for text search is to sprinkle in some elasticsearch which
208+
is a popular solution for text search[^discord-elasticsearch]. Another
209+
option is to write a custom full-text search engine which is a fun project
210+
but a pretty big lift.
214211

215212
I also made the decision to store the raw text and page info in postgres
216213
which makes the table huge and database scans ridiculously slow. There is
217214
really no reason to implement it this way and there are a few better options
218215
like storing page info in a database that is better at clustering like
219-
[Cassandra](https://cassandra.apache.org/) or
220-
[MongoDB](https://www.mongodb.com/) and storing raw text in an object store
221-
like [S3](https://aws.amazon.com/s3/) or [Minio](https://min.io/).
216+
Cassandra or MongoDB and storing raw text in an object store like S3 or
217+
Minio.
222218

223219
4. Throw it in kubernetes.
224220

@@ -280,5 +276,5 @@ on-demand but in the current unfinished state, its not super useful.
280276
[3]: <https://nlp.stanford.edu/IR-book/html/htmledition/dns-resolution-1.html> "IR DNS Resolver"
281277
[4]: <https://www.treehugger.com/remora-fish-suckers-sea-inspiring-new-adhesives-4858201> "Remora Fish, Those Suckers of the Sea, Are Inspiring New Adhesives"
282278
[5]: <https://dl.acm.org/doi/10.1145/1242572.1242592> "Detecting Near-Duplicates for Web Crawling"
283-
[6]: </remora/notes> "My Crawler Project Notes"
279+
[6]: </projects/remora/notes> "My Crawler Project Notes"
284280

0 commit comments

Comments
 (0)