Projects are posted to help find interested individuals with appropriate expertise to implement needs

Unsupervised learning on a large clinical corpus for biomedical knowledge discovery


Project Description

Representation learning is a powerful machine learning technique to capture the intrinsic structure of data and meaningful interdependencies. It is growing in popularity in a wide variety of domains, including computer vision and natural language processing.

The availability of the clinical notes corpus at UCSF (~75M) represents an opportunity to apply these unsupervised methods (e.g. concept embeddings, autoencoders) and obtaining a representation that captures important semantics about medical practice. We propose to apply these methods and develop a framework to test, evaluate and compare different models. E.g does the representation properly capture the similarity between brand-name drugs and their generic counterparts? Does it capture ontological information (e.g. Ulcerative Colitis is a subtype of IBD)?

If successful, such a representation could form the basis of a future platform for biomedical knowledge discovery. For example, asking of it what drugs make IBD better will (hopefully) reveal the drugs currently used to treat the disease, but may also identify new drugs not previously known to reduce disease activity (e.g. drug repurposing).


Required Skills
  • Python
  • HPC/GPU computing
  • Natural Language Processing tools (NLTK, CTAKES, PyTorch-NLP, etc)
  • Deep Learning
Required Course Work or Level of Knowledge
  • Machine Learning, intermediate - advanced
  • Individuals with a data science background, and ideally in deep learning would be most helpful for this project.
Acceptable Level of Education (eg. Undergrad, Grad Students, Post Docs, MD, PhD)

Undergraduate students, Graduate students, Postgraduates, Full-time workers who volunteer time




PI/Research group

Atul Butte


Vivek Rudrapatna, [email protected]