UMP

About the University:

Located at the heart of the future Green City of Benguerir, Mohammed VI Polytechnic University (UM6P), a higher education institution with an international standard, is established to serve Morocco and the African continent. Its vision is honed around research and innovation at the service of education and development. This unique nascent university, with its state-of-the-art campus and infrastructure, has woven a sound academic and research network, and its recruitment process is seeking high-quality academics and professionals to boost its quality-oriented research environment in the metropolitan area of Marrakech.

About the College

The College of Computing (UM6P-CC) is located in the future Green City of Benguerir. It provides world-class university education in computer science promoting discovery and innovation. The College currently offers an engineering degree in computer engineering, two executive masters program in machine learning and cybersecurity, and a doctoral program in computer science.

Description

Most machine learning (ML) models, including Large Language Models, are designed to memorize information and exploit it for reasoning (i.e., use logic to draw a conclusion or make a decision). The larger the model’s parameters, the more information it can encode and the higher its reasoning ability. However, despite their impressive capabilities across diverse tasks and domains, these models often face challenges on knowledge-intensive tasks in real use cases. This is because the models are limited to the information learned during training. To alleviate this problem, the model can be fined-tuned on the datasets relevant to the real use case. However, this solution is static since the model will need retraining if the datasets change, and costly if the model has a very large number of parameters. An alternative solution is to use Retrieval-Augmented Generation (RAG) [1,2], where a language model is augmented with a retriever that can access data from external sources (e.g. confidential company reports). This data is typically represented with embeddings and stored in a vector database. RAG has been demonstrated to reduce the hallucinations of modern Large Language Models (LLMs) [3].

The goal of this thesis is to propose a cost-effective scalable retrieval framework that decouples reasoning from memory, where a lightweight machine learning model is enhanced with a robust and versatile retriever that can extract and rank relevant knowledge encoded in external sources accurately and efficiently. First, an experimental evaluation will be conducted to compare the current retrieval systems and identify their strengths and weaknesses. Second, new retrieval data structures and algorithms will be proposed to enhance lightweight open-source machine learning models. Third, the accuracy and efficiency of the proposed retrieval framework will be evaluated on real use cases.

This thesis will be supervised by Prof. Karima Echihabi from the College of Computing at Mohammed VI Polytechnic University. The selected student will join a growing and supportive team with internationally recognized expertise in data management, machine learning, and deep learning [4, 5]. The successful candidate will have the opportunity to collaborate on projects with industrial partners.

Challenges

The challenges lie in designing and developing novel data structures and algorithms that can scale to billions of deep network embeddings and showcasing their effectiveness in real use cases.

Prerequisites

A masters degree in Computer Science or a closely related field.
An excellent academic record.
Strong aptitude in mathematics, data structures, and algorithms.
Knowledge of relational and non-relational databases.
Excellent programming skills in C/C++/Python.
Research project experience and publications are a plus.

How to Apply

In addition to applying through the platform, email your CV to Prof. Karima Echihabi: karima.echihabi@um6p.ma with the title [PhD-End2EndRAG]. The CV should indicate the class rank during university studies, the type of high-school degree obtained with the overall average, and the score on the national mathematics examination.

References

[1] [1] Chen, J., Lin, H., Han, X. and Sun, L., 2023. Benchmarking Large Language Models in Retrieval-Augmented Generation. arXiv preprint arXiv:2309.01431.

[2] [2] He, H., Zhang, H. and Roth, D., 2022. Rethinking with retrieval: Faithful large language model inference. arXiv preprint arXiv:2301.00303

[3] Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., … & Sifre,L. Improving language models by retrieving from trillions of tokens. In International conference on machine learning (pp. 2206-2240). PMLR.2022.

[4] [4] K. Echihabi, P. Fatourou, K. Zoumpatianos, T. Palpanas, and H. Benbrahim, “Hercules Against Data Series Similarity Search,” in Proceedings of the VLDB Endowment, 2022, vol. 15, no. 10, pp. 2005-2018.

[5] [5] I. Azizi, K. Echihabi, and T. Palpanas, “ELPIS: Graph-Based Similarity Search for Scalable Data Science,” in Proceedings of the VLDB Endowment, 2023, vol. 16, no. 6, pp. 1548-1559.

CEDoc-UM6P-CC: Scalable Retrieval-Enhanced Machine Learning Full-time

UMP

Job Overview

Log In

Sign Up

CEDoc-UM6P-CC: Scalable Retrieval-Enhanced Machine Learning Full-time

UMP

Related Jobs

Freelance Accounting Expert – AI Tutor Part-time

Senior/Lead AI DevOps/SRE Full-time

Finance AI Expert, Remote North America Full-time

Artificial Intelligence Engineer Full-time

Senior Machine Learning Engineer, Access & Safeguards Full-time

Sr. Machine Learning Software Engineer, Creativity Apps Full-time

Job Overview