You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/content/blog/remora.mdx
+12-16Lines changed: 12 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -140,10 +140,9 @@ that I didn't need super high accuracy metrics so the timing information embedde
140
140
in traces was sufficient.
141
141
142
142
For logging I kept it simple, just JSON or logfmt to stdout. For traces I used
143
-
the opentelemetry libraries to send traces to
144
-
[jaeger](https://www.jaegertracing.io/). This allowed me to have good visibility
145
-
into the behavior and performance of the system no matter how distributed it
146
-
became.
143
+
the opentelemetry libraries to send traces to jaeger. This allowed me to have
144
+
good visibility into the behavior and performance of the system no matter how
145
+
distributed it became.
147
146
148
147
149
148
# Future Improvements
@@ -182,9 +181,8 @@ file types, support scaleable filtering, and re-crawl old pages with a cron job.
182
181
fast. When crawling Wikipedia, the size of the queue peeked at around 50
183
182
million messages.
184
183
185
-
One option here is to use [Kafka](https://kafka.apache.org/) which has its
186
-
own downsides (routing will get more difficult) but it is designed to handle
187
-
higher throughput.
184
+
One option here is to use Kafka which has its own downsides (routing will
185
+
get more difficult) but it is designed to handle higher throughput.
188
186
189
187
The last option is to add messages to a simple database (probably postgres
190
188
or sqlite3) and keep track of each end of the queue as you push and pop.
@@ -206,19 +204,17 @@ file types, support scaleable filtering, and re-crawl old pages with a cron job.
206
204
After storing all of Wikipedia, a simple page rank query was on the order of
207
205
minutes which is obviously not ideal.
208
206
209
-
The best option for text search is to sprinkle in some
210
-
[elasticsearch](https://www.elastic.co/elasticsearch) which is a popular
211
-
solution for text search[^discord-elasticsearch]. Another option is to write
212
-
a custom full-text search engine which is a fun project but a pretty big
213
-
lift.
207
+
The best option for text search is to sprinkle in some elasticsearch which
208
+
is a popular solution for text search[^discord-elasticsearch]. Another
209
+
option is to write a custom full-text search engine which is a fun project
210
+
but a pretty big lift.
214
211
215
212
I also made the decision to store the raw text and page info in postgres
216
213
which makes the table huge and database scans ridiculously slow. There is
217
214
really no reason to implement it this way and there are a few better options
218
215
like storing page info in a database that is better at clustering like
219
-
[Cassandra](https://cassandra.apache.org/) or
220
-
[MongoDB](https://www.mongodb.com/) and storing raw text in an object store
221
-
like [S3](https://aws.amazon.com/s3/) or [Minio](https://min.io/).
216
+
Cassandra or MongoDB and storing raw text in an object store like S3 or
217
+
Minio.
222
218
223
219
4. Throw it in kubernetes.
224
220
@@ -280,5 +276,5 @@ on-demand but in the current unfinished state, its not super useful.
280
276
[3]: <https://nlp.stanford.edu/IR-book/html/htmledition/dns-resolution-1.html>"IR DNS Resolver"
281
277
[4]: <https://www.treehugger.com/remora-fish-suckers-sea-inspiring-new-adhesives-4858201>"Remora Fish, Those Suckers of the Sea, Are Inspiring New Adhesives"
282
278
[5]: <https://dl.acm.org/doi/10.1145/1242572.1242592>"Detecting Near-Duplicates for Web Crawling"
0 commit comments