Hello everyone,
I am looking for a simple solution to feed an AI system (RAG) with scientific articles and research data in a stable manner.
At first, I tried to create a homemade scraping script to collect scientific literature, but it’s a real headache: IP blocking, CAPTCHAs and cleaning up HTML files take too much time.
To avoid managing proxies and fragile scraping code, I think it’s simpler to use a dedicated infrastructure. I’ve seen that a solution like ScholarAPI allows you to directly retrieve PDF files and clean metadata in JSON format without any friction.
I would like to have your opinion:
- What do you think is the simplest method to retrieve this type of academic data without getting blocked?
- Do you use ready-made tools or do you prefer to code everything yourself?
Thank you for your advice!