National AI Student Challenge 2024 (Track 1)
The National AI Student Challenge
is a competition organised by AISG
for students to showcase their skills in AI. For 2024, there are 3 tracks. Track 1, by the AI Professionals Association requires participants to create a model to predict the adoption of rates of pets listed on PetFinder.my
, while the other 2 tracks involves developing or tuning an existing large-language model. Due to the eligibility criteria (NSF), I was only able to join the competition under track 1. The challenge consisted of 3 phases. The first phase is the Technical Ability Test that involves participants creating a model based on the problem statement. This is followed by the second phase is the interview, where participants present their findings and models to the judges. Then, if they're lucky enough, participants would compete in the final judging round. I managed to pass all three rounds to clinch 1st prize.
This page is just a summary of my project. For more info, please have a look at the repo
.
Before I could create my model, I had to conduct Exploratory Data Analysis (EDA)
and present the data to the judges. There were roughly 15 thousand pets (or data points, in this case) which had to cleaned up for mistakes like typos and duplicates. In addition, I inferred additional data using the descriptions that came with each pet listing with TextBlob
for sentiment analysis, and Polyglot
to detect the language. After cleaning the data, I used ydata-profiling
to generate a simple report for me. However, most of the analysis is contained in the notebook eda.ipynb where I used Seaborn
and Pandas
to produce graphs and to ultimate draw the judges to important correlations between certain variables and the time it took for a pet to go from being listed to being adopted. These variables include the pet's age, health and number of photos in the listing.
With a clearer picture of the data, I proceeded to create my prediction models. I opted to make models using both scikit-learn
and tensorflow
as I wanted to try both (to be fair, I had loads of time after A-levels). In the end, I had 6 models. For each of scikit-learn and tensorflow, I created the following 3 models, a binary classifier (adopted or not), a speed classifier (how fast a pet was adopted) and a regressor (how fast a pet was adopted, but as a scale). Due to the low correlation between most variables and adoption speed, the models could not learn many patterns from the data, and thus accuracy was relatively low.
For this challenge, it was not sufficient to just have a working prediction model, ethics also seem to play a big part in the overall judging. From the problem statement:
Please note that the development and use of such a prediction model might involve several
ethical and safety concerns that may need to be carefully considered so that the system may
be trusted.
Hence, I placed a greater emphasis on the ethics of my models. For example, I suggested that my model could highlight such disadvantaged pets that my models predict would not be adopted, and recommend them to users.
While I was very happy with the outcome of the competition, I felt that there could be some improvements to my project. For example, when trying to infer data from the description of each pet listing, I could employ the use of LLMs to pull keywords, the presence or absence of which can be used to train the model. I could even find the actual listing for each pet on PetFinder.my and pull the images and videos included in the listing and train a deep learning model on them to predict adoption outcomes. Moreover, we only got data about the pets. If we had information on the users, we can get an idea of what kind of users are more likely to adopt what kind of pets, making recommendations more personal.
Links
-
Code Repo
Lianhe Zaobao Article (Scroll Down for Unglams)
, via Wayback Machine
-
Challenge Website
-
Dataset
-
Interview Presentation