A Practical Application of Retrieval-Augmented Generation for Website-Based Chatbots: Combining Web Scraping, Vectorization, and Semantic Search

Sangita Pokhrel; Bina K C.; Prashant Bikram Shah

doi:10.36548/jtcsst.2024.4.007

A Practical Application of Retrieval-Augmented Generation for Website-Based Chatbots: Combining Web Scraping, Vectorization, and Semantic Search

Open Access

https://doi.org/10.36548/jtcsst.2024.4.007

Vol. 6, No. 4 (2024)

Published: 20 January, 2025

Pages: 424-442

Sangita Pokhrel , Sangita Pokhrel

Department of Computer Science and Data Science, York St John University, London, United Kingdom

Department of Computer Science and Data Science, York St John University, London, United Kingdom
Bina K C. , Bina K C.

Department of Computer Science and Data Science, York St John University, London, United Kingdom

Department of Computer Science and Data Science, York St John University, London, United Kingdom
Prashant Bikram Shah Prashant Bikram Shah

Department of Computer Science and Data Science, York St John University, London, United Kingdom

Department of Computer Science and Data Science, York St John University, London, United Kingdom

view PDF

How to Cite

Pokhrel, Sangita, Bina K C., and Prashant Bikram Shah. 2025. “A Practical Application of Retrieval-Augmented Generation for Website-Based Chatbots: Combining Web Scraping, Vectorization, and Semantic Search”. Journal of Trends in Computer Science and Smart Technology 6 (4): 424-42. https://doi.org/10.36548/jtcsst.2024.4.007.

Keywords

Large Language Models

Retrieval Augmented Generation (RAG)

ChatBot

OpenAI API

Natural Language Processing

Abstract

The Retrieval-Augmented Generation (RAG) model significantly enhances the capabilities of large language models (LLMs) by integrating information retrieval with text generation, which is particularly relevant for applications requiring context-aware responses based on dynamic data sources. This research study presents a practical implementation of a RAG model personalized for a Chabot that answers user inquiries from various specific websites. The methodology encompasses several key steps: web scraping using BeautifulSoup to extract relevant content, text processing to segment this content into manageable chunks, and vectorization to create embeddings for efficient semantic search. By employing a semantic search approach, the system retrieves the most relevant document segments based on user queries. The OpenAI API is then utilized to generate contextually appropriate responses from the retrieved information. Key results highlight the system's effectiveness in providing accurate and relevant answers, with evaluation metrics centered on response quality, retrieval efficiency, and user satisfaction. This research contributes a comprehensive integration of scraping, vectorization, and semantic search technologies into a cohesive chatbot application, offering valuable insights into the practical implementation of RAG models.

References

Bender, E.M., Gebru, T., McMillan-Major, A. and Shmitchell, S., 2021, March. On the dangers of stochastic parrots: Can language models be too big?. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency (610-623). https://doi.org/10.1145/3442188.3445922
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A. and Agarwal, S., 2020. Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, 33, 1877-1901.
Devlin, J., Chang, M.W., Lee, K. and Toutanova, K. (2019) 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding', Proceedings of the 2019 Conferenceof the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, 4171-4186.
Gao, T., Fisch, A., Chen, T., Khashabi, D., and Zettlemoyer, L., 2021. Making Pre-trained Language Models Better Few-Shot Learners. arXiv preprint arXiv:2105.11447.
Guu, K., Lee, K., Tung, Z., Pasupat, P. and Chang, M., 2020, November. Retrieval augmented language model pre-training. In International conference on machine learning (3929-3938). PMLR.
Huang, P.S., He, X., Gao, J., Deng, L., Acero, A. and Heck, L., 2013, October. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management (2333-2338).
Humeau, S., Shuster, K., Lachaux, M.A. and Weston, J., 2019. Poly-encoders: Transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring. arXiv preprint arXiv:1905.01969.
Izacard, G. and Grave, E., 2020. Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Jeong, C., 2023. A study on the implementation of generative ai services using an enterprise data-based llm application architecture. arXiv preprint arXiv:2309.01105. Chroma. (2023). Chroma: Vector Database. https://www.trychroma.com/
Johnson, J., Douze, M. and Jégou, H., 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3), 535-547.
Martin, James H. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Pearson/Prentice Hall, 2009.
Karpukhin, Vladimir, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. "Dense passage retrieval for open-domain question answering." arXiv preprint arXiv:2004.04906 (2020).
LangChain. (2023). LangChain Documentation. https://www.langchain.com/docs
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.T., Rocktäschel, T. and Riedel, S., 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33, 9459-9474.
Liu, P.J., Saleh, M., Pot, E., Goodrich, B., Sepassi, R., Kaiser, L. and Shazeer, N., 2018. Generating wikipedia by summarizing long sequences. arXiv preprint arXiv:1801.10198.
Manning, C.D., Raghavan, P. and Schütze, H., 2008. Introduction to Information Retrieval. Cambridge: Cambridge University Press.
Microsoft (2023) 'Retrieval Augmented Generation using Azure Machine Learning prompt flow'. Available at: https://learn.microsoft.com/en-us/azure/machine-learning/concept-retrieval-augmented-generation?view=azureml-api-2 (Accessed: 12 September 2024).
Mitchell, Ryan. Web scraping with Python: Collecting more data from the modern web. O'Reilly Media, 2018.OpenAI. (2023). OpenAI API. https://openai.com/api/
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D. and Sutskever, I., 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.
Pokhrel, Sangita, and Shiv Raj Banjade. "AI Content Generation Technology based on Open AI Language Model." Journal of Artificial Intelligence and Capsule Networks 5, no. 4 (2023): 534-548
Pokhrel, Sangita, Swathi Ganesan, Tasnim Akther, and Lakmali Karunarathne. "Building Customized Chatbots for Document Summarization and Question Answering using Large Language Models using a Framework with OpenAI, Lang chain, and Streamlit." Journal of Information Technology and Digital World 6, no. 1 (2024): 70-86
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W. and Liu, P.J., 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140), 1-67.
Reimers, N. and Gurevych, I., 2019. Sentence-BERT: Sentence embeddings using siamese BERT-networks. arXiv preprint arXiv:1908.10084.
Richardson, L. (2022). Beautiful Soup Documentation.[Online]. Available at: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ .
Streamlit. (2023). Streamlit: The Fastest Way to Build Data Apps. https://streamlit.io/
Vaswani, A., 2017. Attention is all you need. Advances in Neural Information Processing Systems. arXiv:1706.03762v7[cs.CL]
Pokhrel, Sangita, Nalinda Somasiri, Rebecca Jeyavadhanam, and Swathi Ganesan. "Web Data Scraping Technology Using Term Frequency Inverse Document Frequency to Enhance the Big Data Quality on Sentiment Analysis." International Journal of Electrical and Computer Engineering 17, no. 11 (2023): 300-307.
Voigt, P. and Von dem Bussche, A. (2017) The EU General Data Protection Regulation (GDPR): A Practical Guide. 1st Edition, Springer International Publishing, Cham. https://doi.org/10.1007/978-3-319-57959-7
Zhang, Y., Sun, S., Galley, M., Chen, Y.C., Brockett, C., Gao, X., Gao, J., Liu, J. and Dolan, B., 2019. Dialogpt: Large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536.

A Practical Application of Retrieval-Augmented Generation for Website-Based Chatbots: Combining Web Scraping, Vectorization, and Semantic Search

How to Cite

Download Citation

Keywords

Abstract

References