An eight-week project that identified companies potentially involved in the production of materials used in the nuclear fuel cycle. Our team worked with many new technologies: PySpark, clustered computing on Amazon Web Services EMR, Sagemaker, and S3, and Parquet, monitoring billing for AWS services.
Analysed and cleaned large (175 GB) data set Increased positive class via anomaly detection Developed models using NLP