Skip to Main content Skip to Navigation
Conference papers

It's a Tree... It's a Graph... It's a Traph!: Designing an on-file multi-level graph index for the Hyphe web crawler

Abstract : Hyphe, a web crawler for social scientists developed by the SciencesPo médialab, introduced the novel concept of web entities to provide a flexible and evolutive way of grouping web pages in situations where the notion of website is not relevant enough (either too large, for instance with Twitter accounts, newspaper articles or Wikipedia pages, or too constrained to group together multiple domains or TLDs...). This comes with technical challenges since indexing a graph of linked web entities as a dynamic layer based on a large number of URLs is not as straightforward as it may seem. We aim at providing the graph community with some feedback about the design of an on-file index - part Graph, part Trie - named the "Traph", to solve this peculiar use-case. Additionally we propose to retrace the path we followed, from an old Lucene index, to our experiments with Neo4j, and lastly to our conclusion that we needed to develop our own data structure in order to be able to scale up.
Complete list of metadata

https://hal-sciencespo.archives-ouvertes.fr/hal-03619528
Contributor : Spire Sciences Po Institutional Repository Connect in order to contact the contributor
Submitted on : Friday, March 25, 2022 - 10:45:45 AM
Last modification on : Friday, May 20, 2022 - 3:56:02 PM

Identifiers

Collections

Citation

Guillaume Plique, Mathieu Jacomy, Benjamin Ooghe, Paul Girard. It's a Tree... It's a Graph... It's a Traph!: Designing an on-file multi-level graph index for the Hyphe web crawler. FOSDEM'18, Feb 2018, Bruxelles, Belgium. ⟨hal-03619528⟩

Share

Metrics

Record views

8