Projects are posted to help find interested individuals with appropriate expertise to implement needs

Feature selection on clinical text data to predict the onset of IBD


Project Description

Inflammatory Bowel Disease is a chronic disease that is on the rise globally but without a known cause. To make progress towards understanding its cause and developing preventative strategies, we must find ways to successfully identify and study at-risk individuals before they develop the disease.

Many ongoing studies are using genomic and proteomic biomarkers to predict IBD onset. However, no studies to date have assessed the potential of ‘digital biomarkers’ in the form of electronic health records data, especially the free-text. We may find that IBD can be predicted by unique textual or other features captured during routine clinical visits pre-diagnosis. We are proposing to use the data science methods of feature selection and dimensionality reduction on these data to help identify these patients.

If you are clinically-oriented, you’d help me curate the charts to find these patients who have data pre-diagnosis. If you an ML/NLP superstar, you’ll help drive the data science and develop a ‘digital biomarker’.




Required Skills
  • Python
  • HPC/GPU computing
  • Natural Language Processing tools (NLTK, CTAKES, PyTorch-NLP, etc)
Required Course Work or Level of Knowledge
  • Machine Learning: intermediate - advanced
  • Ideally, candidates would fall into one of the following categories.
    1. Experience with machine learning and natural language processing. Deep learning experience would be a plus but not critical.
    2. A medical background (to assist in study design and manual data abstraction for model training).
Acceptable Level of Education (eg. Undergrad, Grad Students, Post Docs, MD, PhD)

Undergraduate students,Graduate students,Postgraduates,Full-time workers who volunteer time

Additional Considerations

Will gladly accept any help and/or interest. As above, if ML/NLP experienced: you'd help me build the model. If not: you'd help me read and annotate the notes for the model


PI/Research group

Butte Lab


Vivek Rudrapatna, [email protected]