indexr needs to be able to classify papers "on the spot"
While model training can be done offline once a week, we need to be able to handle papers which aren't currently in arXiv_dataset.parquet
. An example Use Case would be:
- User submits arXiv URL to frontend, frontend sends HTTP GET request to backend
- FastAPI Backend receives request, calls
topic_search()
- The
else
case triggers, but instead of returning "this paper hasn't been classified"... - request the paper metadata from the arXiv API using the arXiv python package
- create a TF-IDF vector of the abstract using a saved Vectorizer (load it from
pickles/
folder) - load the topic model and then do something like
new_paper_topic_vector = model.transform(new_paper_tfidf_vector)
- append the metadata to the end of
meta_df
- append the new topic vector to the end of
doc_topic_mat
- Proceed with the rest of the process now that the paper has been added to the dataframe and the doc-topic matrix.
Edited by Derek Rodriguez