Review Article | Open Access
Anusha Bompelli, Yanshan Wang, Ruyuan Wan, Esha Singh, Yuqi Zhou, Lin Xu, David Oniani, Bhavani Singh Agnikula Kshatriya, Joyce (Joy) E. Balls-Berry, Rui Zhang, "Social and Behavioral Determinants of Health in the Era of Artificial Intelligence with Electronic Health Records: A Scoping Review", Health Data Science, vol. 2021, Article ID 9759016, 19 pages, 2021. https://doi.org/10.34133/2021/9759016
Social and Behavioral Determinants of Health in the Era of Artificial Intelligence with Electronic Health Records: A Scoping Review
Background. There is growing evidence that social and behavioral determinants of health (SBDH) play a substantial effect in a wide range of health outcomes. Electronic health records (EHRs) have been widely employed to conduct observational studies in the age of artificial intelligence (AI). However, there has been limited review into how to make the most of SBDH information from EHRs using AI approaches. Methods. A systematic search was conducted in six databases to find relevant peer-reviewed publications that had recently been published. Relevance was determined by screening and evaluating the articles. Based on selected relevant studies, a methodological analysis of AI algorithms leveraging SBDH information in EHR data was provided. Results. Our synthesis was driven by an analysis of SBDH categories, the relationship between SBDH and healthcare-related statuses, natural language processing (NLP) approaches for extracting SBDH from clinical notes, and predictive models using SBDH for health outcomes. Discussion. The associations between SBDH and health outcomes are complicated and diverse; several pathways may be involved. Using NLP technology to support the extraction of SBDH and other clinical ideas simplifies the identification and extraction of essential concepts from clinical data, efficiently unlocks unstructured data, and aids in the resolution of unstructured data-related issues. Conclusion. Despite known associations between SBDH and diseases, SBDH factors are rarely investigated as interventions to improve patient outcomes. Gaining knowledge about SBDH and how SBDH data can be collected from EHRs using NLP approaches and predictive models improves the chances of influencing health policy change for patient wellness, ultimately promoting health and health equity.
Social and behavior determinants of health (SBDH or often called SDOH) are conditions in the environments in which people are born, live, learn, work, play, worship, and age that affect a wide range of health, functioning, and quality of life outcomes and risks . The SBDH are categorized into five key categories: economic stability; education access and quality; social and community context; neighborhood and built environment; and healthcare access and quality . As our population is becoming more diverse, there is growing evidence demonstrating the significant impact of SBDH on various healthcare outcomes such as mortality [2, 3], morbidity , life expectancy , healthcare expenditures , health status, and functional limitations . For example, a study showed that SBDH factors, including education, racial inequality, social support, and poverty, accounted for more than a third of the estimated annual deaths in the United States [6, 7].
It is necessary not only to overcome SBDH in order to enhance public health but also to eliminate health inequalities that are often entrenched in social and economic inequalities. One way to address this is by integrating SBDH into the electronic health record (EHR). Increased use of EHR systems in healthcare organizations has facilitated secondary use of EHR data through artificial intelligence (AI) techniques to improve patient care outcomes , via clinical decision support systems, chronic disease management, and patient education. Most recently, AI methods were used to propose candidate drugs for COVID-19 .
SBDH information in the EHR is stored in both structured (e.g., education and salary level) and unstructured formats (e.g., social history in clinical notes). Since there is no standardized framework for recording SBDH information and such information is usually incompletely recorded  in a structured format, it is often difficult to identify SBDH present in an unstructured format and to establish a connection between SBDH and disease or health outcomes. Approaches that leverage natural language processing (NLP) tools to extract SBDH information stored in an unstructured format in EHR are still limited. Prior literature reviews on SBDH mainly focused on integration of SBDH into EHR and impact of SBDH in risk prediction , the role of SBDH in mental health  and availability and characteristics of SBDH in EHR . None of these reviews discussed the AI methods using SBDH in EHR data. This paper provides a scoping review of the SBDH factors, the relationship between SBDH and diseases, the NLP techniques used to extract SBDH information from clinical notes, and predictive models using SBDH factors to predict health outcomes.
2.1. Data Sources and Search Strategies
This scoping review followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. A comprehensive search of several databases from 2010 to October 15th, 2020 (English language), was conducted. The databases included Ovid MEDLINE(R) and Epub Ahead of Print, In-Process & Other Non-Indexed Citations, and Daily, Ovid EMBASE, Scopus, Web of Science, the ACM Digital Library, and IEEE Xplore. The editorial, erratum, letter, note, and comment article types were excluded. The search patterns used in these databases were consistent. We iteratively updated our searching keywords by searching for relevant articles and identifying specific SBDH-related keywords from those articles and repeating the process. The final keywords used in the search query are shown in Table 1, and the query was implemented by an experienced librarian. A detailed description of the search strategies used is provided in Supplemental Document 1.
2.2. Article Selection
Following the acquisition of potential articles, the following steps were abstract and full text screening. Article exclusion criteria included the following: duplicates; conference abstracts; unavailable full text; and articles unrelated to SBDH, AI, or EHR. Two reviewers used the eligibility criteria to screen the articles for selection from title and abstract screening in the first round. In the second round, seven reviewers reviewed the papers for full text and relevance, with the workload evenly distributed among them. The selected studies from the second round were reviewed further by two reviewers. All conflicts that occurred during the screening rounds were discussed until consensus was reached.
2.3. Data Extraction from Included Studies
Two reviewers (RZ and YW) developed and specified the data elements to be retrieved, while the remaining seven reviewers extracted the data from the full-text articles. The following data items were captured in a Supplementary Table 1 titled “Information extracted from the included studies”: citation, country, SBDH data source, clinical note type, disease, disease category (ICD 10), SBDH focus, SBDH category, SBDH level, SBDH role, study cohort (patients), study cohort (clinical sites), and AI methods.
2.4. Data Synthesis and Analysis
SBDH research is an interdisciplinary field that combines healthcare, social science, and informatics. Based on the articles identified, our data synthesis was motivated by an approach to gain insight into how SBDH affects disease risk or onset prediction using AI and review how effective current NLP systems are at extracting various types of SBDH. We began by examining the general characteristics of the 79 included studies, such as publication trend and journal venues. We examined the coverage of SBDH and disease categories and illustrated their relationships to provide insight into how SBDH characteristics influence healthcare-related statuses. Furthermore, we investigated NLP approaches for extracting SBDH and predictive models for predicting outcomes using SBDH characteristics. Supplementary Table 1 contains further information about the included articles and the review analysis. Regarding software, Zotero was utilized for citation management, Microsoft Excel for data extraction and collection, and RAWGraphs for data analysis.
3.1. Identification of Included Studies
A total of 1433 articles were retrieved from five databases, of which 643 articles were found to be unique. The articles were then filtered manually based on the title and abstract to check whether articles are related to SBDH and AI based on the EHR data. 283 articles remained for subsequent full-text reviews. The inclusion criteria for the target publications are as follows: (1) AI techniques are used, (2) the SBDH information is from the EHR data, and (3) EHR data is in English. Articles without full text were excluded. After this full-text screening process, 79 articles were selected in this scoping review. A flow chart of the article selection process is shown in Figure 1.
3.2. General Characteristics of the Included Studies
The general characteristics of the included studies are summarized in Table 2. All the included studies are published between 2012 and 2021 (Figure 2) with an increasing number of publications from 2012 to 2020. The current review includes publications across 11 countries (Figure 3), with most of the contribution from the United States (77%). Investigation of the publication venues indicates the research communities that utilize SBDH from EHRs by leveraging informatics techniques. The types of venues for conferences are determined through manually researching conference information. The venues for journals are determined through Scimago and Clarivate Analytics. All studies are generally divided into four different types: (1) clinical (), (2) informatics (), (3) social science (), and (4) multidiscipline ().
SBDH: social and behavioral determinants of health; NLP: natural language processing.
EHRs are used in all studies in the review. Overall, 40.5% of the reviewed studies used structured data, 29.1% used unstructured data, and 30.4% used both structured and unstructured data. Other data is from claims data [13–17], NHANES , and clinical trials . Individual SBDH data were used in 62.0% of studies, neighborhood SBDH data were used in 8.9% of studies, and both individual and neighborhood data on SBDH were used in 29.1% of studies. Most of the studies (45.6%) have less than 10,000 data samples, 24.1% of studies have between 10,000 and 100,000 samples, and 18.9% have more than 100,000 samples.
Among all included studies, the most common usage (70.9%) of the SBDH information is acting as predictors for health outcomes, followed by disease management (21.5%) and outcome of analysis (7.6%). The percentage of nonclinical studies (48%) is slightly lower than that of clinical studies (52%). NLP and predictive modeling techniques are the 2 main types of AI methods used to obtain and use SBDH information in the reviewed studies. 29% of the total reviewed studies used only predictive modeling, 18% used only NLP, and 3% used only statistical analysis. The rest of the studies used either of 2 of the methods mentioned above (35%) or all 3 methods (16%).
3.3. SBDH Type
There is no single standardized method for categorizing SBDH factors. For example, the WHO  SBDH conceptual framework includes socioeconomic and political context, socioeconomic position, social cohesion and social capital, and health system. However, Healthy People 2030  categorized SBDH factors into economic stability, education access and quality, social and community context, neighborhood and built environment, and healthcare access and quality instead. Since the articles we reviewed had no information on the sociopolitical context, we have classified the SBDH factors according to the Healthy People 2030 framework. Several studies have attempted to study SBDH in order to determine the impact of social factors on health. The inclusion of SBDH in Table 3 was determined by coverage of SBDH in one or more publications. From the articles reviewed, the SBDH factors identified were categorized into one of the five SBDH categories. While identifying SBDH factors, few SBDH mentions from the reviewed articles have been standardized to a single SBDH factor to prevent intense granularity of SBDH factors. For example, SBDH mentions such as alcohol, tobacco, and drug abuse have been normalized to substance use/abuse SBDH factor. Most of the studies focused on more than one SBDH factor belonging to different SBDH categories.
Overall, 29.5% of the studies reviewed focused on SBDH factors associated with healthcare access and quality, 24.7% focused on economic stability, 20% focused on social and community context, 16.3% focused on neighborhood and built environment, and 9.5% focused on education access and quality. Widely studied SBDH factors include substance use/abuse (9%) [14, 16, 20–35], education (7.3%) [16, 21, 36–46], employment status (6.3%) [16, 20, 21, 23, 31, 36, 38, 39, 45–48], socioeconomic status (6.3%) [29, 39, 44, 46, 49–56], lifestyle (5.8%) [22, 29, 32, 43, 57–63], socioeconomic factors (5.3%) [28, 38, 39, 46, 58, 63–67], diet (5.3%) [21, 22, 26, 43, 58, 68–72], housing status (5.3%) [16, 21, 33, 35, 38, 46, 73–76], social support (5.3%) [14, 15, 35, 74, 77–82], physical activity (4.8%) [20, 31, 40, 58, 68–70, 78, 83], marital status (3.7%) [31, 36, 40, 45, 55, 84, 85], housing instability (2.6%) [14, 16, 45, 75, 86], environmental factors (2.1%) [20, 52, 56, 87], insurance (2.1%) [20, 24, 38, 88], and homelessness (2.1%) [89–93]. Other significant SBDH factors include geographic location [24, 31, 44, 70], health literacy [43, 47, 88, 94], social patterns (sexual health, adverse experiences and behavioral attitudes) [33, 52, 73, 95], social environment [41, 56, 77], health access [21, 54, 88], living condition [20, 31, 35], social behavior [54, 63, 82], and financial insecurity [81, 86].
3.4. Relations between SBDH and Health Outcomes
3.4.1. Healthcare-Related Statuses
Among the 79 articles, 59 articles focused on SBDH factors associated with one or more disease conditions. From the 59 articles reviewed, the diseases investigated were grouped into one of the 12 categories (see Table 4). 25% of the studies focused on mental, behavioral, and neurodevelopmental disorders; 17% on endocrine, nutritional, and metabolic diseases; and 10% on diseases of the circulatory system. The most widely studied diseases include diabetes [45, 57, 59, 61, 66, 94], obesity [45, 55, 68, 72], geriatric syndrome [15, 58, 78, 80], and HIV [51, 82, 95]. Other notable mentions include hypertension [45, 60], stroke [45, 74], dementia [22, 40], and cancer [32, 43]. An interesting finding is that about 6 articles [23, 27–29, 35, 83] studied SBDH factors related to outcome measures such as hospital readmission risk, all-cause nonelective readmission.
3.4.2. SBDH Categories
SBDH reflects social, physical, economic, environmental influences that can or cannot be regulated by the person but have a major effect on the wellbeing of the individual. The SBDH factors can act as either risk factors or intervention factors and can influence the burden of disease. There has been very little research on the association between SBDH characteristics and disease, as well as the prevalence of SBDH in any disease conditions. It is necessary to know the factors that influence the disease and to better understand the relationship; we have tried to draw attention to the diseases listed in the articles and their respective SBDH factors. Figure 4 shows that the top 10 SBDH factors include education, substance use/abuse, socioeconomic status (SES), diet, lifestyle, social support, employment status, socioeconomic factors, marital status, and physical activity, whereas the top 5 diseases include obesity, geriatric syndrome, diabetes, HIV, and childhood obesity. The center part of Figure 4 shows the associations between SBDH with a variety of diseases are shown.
(1) Healthcare Access and Quality. The SBDH factors such as substance use/abuse, diet, lifestyle, and physical activity were mainly studied in endocrine, nutritional, and metabolic diseases (diabetes and obesity) and mental, behavioral, and neurodevelopmental disorders (Alzheimer’s disease (AD), dementia, mental disorders, delirium, and depression). Two studies showed data integration through the semantic ETL service  and the MOSAIC dashboard system ; using SBDH factors such as lifestyle, physical activity, and diet could improve the management of obesity and diabetes. Zhang et al. showed that educating patients with lifestyle interventions has been associated with improved glycemic control in diabetes patients ; whereas Zhou et al. analyzed published lifestyle exposures and related intervention strategies for AD patients .
(2) Economic Stability. The association between SBDH factors such as socioeconomic factors, employment status, SES and mental, behavioral, and neurodevelopmental disorders (opioid misuse, attention deficit hyperactivity disorder (ADHD), substance use disorders (SUDs)) and injury, poisoning, and certain other consequences of external causes (suicide, postdeployment stress) was widely studied. Zhang-James et al. and Afshar et al. studied the impact of low socioeconomic status and socioeconomic distress in individuals with at-risk comorbid SUDs  and opioid misuse , respectively. Zheng et al. developed an early-warning system to identify patients at risk of suicide attempts , while a few studies stated that the predictors like SES  and socioeconomic factors  can be used to predict suicide risk.
(3) Education Access and Quality. Education factor was critically analyzed in the mental, behavioral, and neurodevelopmental disorder category (ADHD, bipolar disorder, dementia, mental disorders, opioid misuse, schizophrenia, and SUDs). Education level in association with other factors like employment status and income was found to have a significant correlation to suicidal behavior in patients with mental illness . Senior et al. developed the OxMIS tool to predict suicide in patients with severe mental illness using SBDH factors such as the highest education and substance abuse .
(4) Social and Community Context. Although about 15 SBDH factors belong to this category, factors such as social support and marital status have been widely studied. Poor social support has shown to have an impact on hospital readmission , in patients with HIV  or dementia . Biro et al. and Ge et al., respectively, studied the relationship between marital status and conditions such as obesity  and suicidal ideation specific to major depressive disorder .
(5) Neighborhood and Built Environment. Housing status including homelessness and geographic location was focused in the mental, behavioral, and neurodevelopmental disorder category (delirium, ADHD, opioid misuse, and SUDs) and diseases of the circulatory system (congestive health failure, acute myocardial infarction, and stroke). Davoudi et al. and Nau et al. studied how geographic location serves as a surrogate of socioeconomic characteristics of the neighborhood that have been shown to be associated with multiple diseases and health behaviors  and high mortality .
3.4.3. The Impact of SBDH on Disease
Among the individual diseases, obesity has been extensively studied, and the most researched factors include diet, marital status, education, employment status, housing instability, material and social deprivation, physical activity, sleep, SES, and socioenvironmental neighborhood. Biro et al. examined the association between obesity and material and social deprivation and found that patients in the most deprived group were 35% more likely to be obese than patients in the least deprived group . Nau et al. discovered 13 variables of the social, dietary, and physical activity environment that, when combined, correctly categorized 67 percent of communities as obesoprotective or obesogenic using mean BMI-z as a surrogate .
Social environment, education, financial insecurity, work-related challenges, lifestyle, and environmental factors were primary SBDH factors associated with mental disorders. Kim et al. reported that patients who abused alcohol were 3.3 times more likely to commit suicide than those who did not. According to adjusted predictive models, chart-noted alcohol abuse had a stronger association with suicide mortality than administrative data-based alcohol abuse diagnoses . According to Davoudi et al., age, alcohol or drug misuse, and socioeconomic level can all increase the incidence of delirium . According to Wang et al., the marital status of the dementia population was greater in the “single” and “widowed” categories and a higher proportion in “high school and equivalent” when compared to the nondementia population . Walsh et al. incorporated risk factors such as comorbidities, medication use, clinical encounter histories, socioeconomic status, and demographics and reported that machine learning models produced reliable prediction of nonfatal suicide attempts across several cohort comparisons and time frames . Grinspan et al. demonstrated that bivariate analysis revealed multiple potential predictors of ED usage for epilepsy (demographics, social determinants of health, comorbidities, insurance, disease severity, and prior health care utilization) . Zhou et al. demonstrated the viability of NLP techniques for the automated evaluation of a large number of lifestyle habits in patients with Alzheimer’s disease using free text EHR data . Zhang-James et al. used machine learning algorithms to construct prediction models to identify ADHD youth at risk for SUDs by taking socioeconomic, educational, and geographic data into account .
Diabetes-related SBDH factors included lifestyle, socioeconomic factors, employment status, housing instability, education, and marital status. Zhang et al. discovered new quantitative characteristics of electronic records of lifestyle counseling that are linked to better glucose control in diabetic patients .
Social support, physical activity, diet, lifestyle, and socioeconomic factors were commonly studied with regard to geriatric syndrome. Anzaldi et al. discovered that the most common geriatric syndrome pattern among “frail” patients was a combination of walking difficulty, lack of social support, falls, and weight loss . Kharrazi et al. used unstructured EHR notes enabled by the NLP algorithm to identify significantly higher prevalence of geriatric syndromes; 28 percent of patients lack social support . Kuo et al. used machine learning to extract geriatric syndromes from EHR free text in order to identify vulnerable older adults and possibly address functional and social inequities in the geriatric population .
Most of the HIV-related SBDH factors fall within the social and community context, such as social behavior, social discrimination, incarceration, social support, and racial disparities. According to Feller et al., socioeconomic determinants of health are increasingly acknowledged as predictors of HIV infection and were also included in the model by included words relating to drug use, housing instability, and psychological comorbidities. Structured EHR data, on the other hand, had a high variable relevance in the predictive models, and thus unstructured clinical text and structured EHR data exist as complementing sources of information for automated HIV risk assessment . Wang et al. established the viability of using the EHR to quantify imprisonment exposure, a prevalent social determinant of health, for research purposes, particularly among racial and ethnic minorities and low socioeconomic status HIV patients .
3.5. NLP Methods for Extracting SBDH from Clinical Texts
SBDH information can be found in various types of EHR data. A comprehensive summary of SBDH category, their associated data type, and AI methods for various studies is illustrated in Figure 5. For structured data such as progress notes, discharge summaries, primary care notes, admission notes, and clinical notes, using database queries and descriptive statistics is a straightforward and frequently used method. For unstructured data, NLP is widely used to extract information. Information extraction (IE) is the task of automatically extracting predefined types of information that is SBDH here, from unstructured or semistructured data which is either EHR or clinical notes of patients’ visiting a clinical site. In this review, 42 papers used NLP methods to extract SBDH from clinical notes.
3.5.1. NLP Tools
Totally 42 articles used NLP tools including Apache cTAKES, MetaMap, Moonstone, MTERMS, BioMedICUS, MediClass, and LEO [14–16, 20, 22, 26, 31–35, 38, 40–42, 43, 45, 46, 48, 51, 59, 61, 62, 68, 69, 71, 73, 74, 76–81, 83, 84, 86, 87, 90, 94, 95]. Apache cTAKES was developed on the UIMA platform and Apache openNLP toolkit and is one of the most popular NLP tools for clinical IE from EHR data . cTAKES was used to identify subtypes of patients with opioid misuse  and lifestyle modification . Shoenbill et al. combine cTAKES with rules and regular expressions on selected EHRs to make previously unseen data on lifestyle modification documentation visible. They reported that the results on testing the NLP tool refinement process for combined lifestyle modification retrieval were excellent with 99.27% recall and 94.44% precision and an F-measure of 96.79% . MetaMap was originally developed to map biomedical literature to UMLS Metathesaurus concepts but later applied to clinical texts. MetaMap was used for the identification of homeless patients  and extraction of lifestyle information . The Moonstone system is an open-source rule-based clinical NLP system designed to automatically extract information from clinical notes, especially those requiring inferencing from lower-level concepts . The system was designed to extract social risk factors including housing situations, living alone, and social support [35, 74]. MTERMS  encodes clinical text using different terminologies and simultaneously establishes dynamic mappings between them. It was originally designed to extract medication information from clinical notes to facilitate real-time medication reconciliation and later has been extended to support a variety of clinical applications, such as risk factor identification from physician notes . The study  used MTERMS to extract social factor information from physician notes and found that when compared to an 18.6% baseline readmission rate, risk-adjusted analysis exhibited higher readmission risk for patients with housing instability (readmission rate 24.5 percent; ), depression (20.6 percent; ), drug abuse (20.2 percent; ), and poor social support (20.0 percent; ).
BioMedICUS https://github.com/nlpie/biomedicus3 is an open-source system, built on the Unstructured Information Management Architecture (UIMA) framework, for large-scale text analysis and processing of biomedical and clinical reports. It was used to process social history documents to identify social history topics . MediClass  is a knowledge-based system that can detect clinical events in both structured and unstructured EHR data. Hazlehurst et al. used the MediClass system with NLP components to identify the 5As (a framework for behavior counseling) of weight loss counseling . The automated method  successfully recognized many valid cases of Assist that were not identified in the gold standard and on comparison with the gold standard, mean sensitivity and specificity for each of the 5As was at or above 85%, with the exception of sensitivity for Assist which was measured at 40% and 60%, respectively, for each of the two health systems. The Leo is a framework that was first built to support scalable deployment of NLP pipelines for processing a large amount of Veteran Affairs clinical notes. It facilitates the rapid creation and deployment of Apache UIMA-Asynchronous Scaleout annotators. The Leo system employed rules to identify the plan portion of the medical records and to detect words and phrases that capture instances of behavioral modification counseling . Gundlapalli et al. demonstrate the feasibility of extracting positively asserted concepts related to homelessness from the free text of medical records with a two-step approach. They first developed a lexicon of terms related to homelessness and then use the V3NLP Framework to detect instances of lexical terms and compare them to the human annotated reference standard. Their approach has a positive predictive value of 77% for extracting relevant concepts . Schillinger et al. developed NLP techniques to build scalable and reliable literacy profiles for identifying limited health literacy patients with respect to clinical diabetic patients’ data. Using linguistic indices, they developed two automated literacy profiles that have sufficient accuracy in classifying levels of either self-reported health literacy (C-statistics: 0.86 and 0.58, respectively) or expert-rated health literacy (C-statistics: 0.71 and 0.87, respectively) and were significantly associated with educational attainment, race/ethnicity, Consumer Assessment of Provider and Systems (CAHPS) scores, adherence, glycemia, comorbidities, and emergency department visits .
3.5.2. Rule-Based Methods
Rule-based methods were widely used (7 studies) [15, 32, 38, 76, 78, 81, 83] for extracting SBDH information from clinical records. Rules including key terms were usually manually curated by domain experts. Hollister et al. used 860 terms to extract socioeconomic status-related SBDH data from free texts to demonstrate the feasibility of retrieving SBDH data and linking with other health-related data for genetic studies . A pattern-based NLP method was used to identify additional syndromes from clinical notes including malnutrition and lack of social support . Incorporating unstructured EHR notes, enabled by applying the pattern-based NLP method, identified considerably higher rates of geriatric syndromes: absence of fecal control (2.1%, 2.3 times as much as structured claims and EHR data combined), decubitus ulcer (1.4%, 1.7 times as much), dementia (6.7%, 1.5 times as much), falls (23.6%, 3.2 times as much), malnutrition (2.5%, 18.0 times as much), lack of social support (29.8%, 455.9 times as much), urinary retention (4.2%, 3.9 times as much), vision impairment (6.2%, 7.4 times as much), weight loss (19.2%, 2.9 as much), and walking difficulty (36.34%, 3.4 as much). In , the researchers developed an IE system, which is based on a novel permutation-based pattern recognition method to extract information from unstructured clinical documents. The overall IE accuracy of the system for semistructured text was recorded at 99%, while that for unstructured text is 97%. Furthermore, the automated, unstructured IE has reduced the average time spent on manual data entry by 75%, without compromising the accuracy of the system. In , they suggested some categories of SES data were easier to extract in EHR than others. SES data extracted from a manual review of 50 randomly selected records were compared to data produced by the algorithm, resulting in positive predictive values of 80.0% (education), 85.4% (occupation), 87.5% (unemployment), 63.6% (retirement), 23.1% (uninsured), 81.8% (Medicaid), and 33.3% (homelessness). Also, rule-based NLP methods were used to extract information to represent patients in different groups. In , 2202 patients were described as “frail” in clinical notes. These patients were older ( vs. , ) which had a significantly higher rate of healthcare utilization than the rest of the population (). In , the researchers used NLP to categorize reasons for social work referral documented in EHR referral orders. The most frequent needs leading to a social work referral were financial (25%), pregnancy (25%), behavioral health (16%), and family/social support (9%) needs. The most frequently co-occurring needs are pregnancy with language limitation (; ); behavioral health with family/social support (; ); and financial with behavioral health (; ).
3.5.3. Term Expansion
The limitations for rule-based methods are that they are hard to capture comprehensive lexical variation in the clinical records. Thus, 4 studies investigated using unsupervised learning methods to further expand terms or get word representations from unannotated clinical texts. Word embeddings are such an approach, which is a type of word representations that allows semantically similar words to have a similar presentation based on the contexts of a corpus. Shi et al. used word2vec to retrieve the vector representation for each word in the EHR data, which was then added into a deep neural network model . Lexical association is another approach, a measure determining the strength of association between two terms in a text corpus . Dorr et al. used lexical association approaches to expand psychosocial terms in clinical texts and helped them to identify a 90-fold increase in patients . Bejan et al. implemented both lexical association and word2vec approaches to expand keywords of homelessness . The word2vec was found to perform better (area under the precision-recall curve [AUPRC]=0.94) than lexical associations () for extracting homelessness-related words.
3.5.4. Topic Modeling
Topic modeling is one of the unsupervised learning methods to explore latent topics in a given corpus. Three studies [40, 46, 95] used topic modeling for clinical notes. Latent Dirichlet Allocation (LDA) is a robust topic model, which learns K topics for a given corpus, where each topic is represented as a distribution of n words. Wang et al. used LDA to explore various themes (e.g., nutrition and social support) mentioned in care provider notes of dementia patients . Among 250 topics generated by LDA models, they identified 224 stable topics. Some topics convey similar themes, and the domain experts analyzed all stable topics and classified them into 72 unique categories, such as medication delivery and hospital care. Feller et al. used both LDA and TF-IDF to identify keywords for developing a prediction model on HIV risk assessment. These keywords are related to drug use and housing instability . The patterns and trends of the generated topics provide unique findings and insights that are often not documented in the structured data fields in the EHR.
3.5.5. Deep Learning
A couple of studies developed deep learning methods [42, 80]. Chen et al. trained a deep neural network model using contextual information to identify sentences indicating the presence of a geriatric syndrome including lack of social support from clinical notes . Contextual information improved classification, with the most effective context coming from the surrounding sentences. At sentence level, their best performing model achieved a micro-F1 of 0.605, significantly outperforming context-free baselines. At the patient level, their best model achieved a micro-F1 of 0.843. Senior et al. developed a neural network model to extract information such as the highest formal education from clinical notes as predictors for suicide in severe mental illness . They developed a named entity recognition (NER) model, which recognizes concepts in free text. The model identified eight concepts relevant for suicide risk assessment: medication (antidepressant/antipsychotic treatment), violence, education, self-harm, benefits receipt, drug/alcohol use disorder, suicide, and psychiatric admission. The NER model had an overall precision of 0.77, recall of 0.90, and F1 score of 0.83. The concept with the best precision and recall was medication (precision = 0.84 and recall = 0.96), and the weakest were suicide (precision = 0.37) and drug/alcohol use disorder (recall 0.61).
3.5.6. Corpus Development
Developing a corpus is vital for developing reliable NLP methods. Three studies have focused on development of an annotated corpus on SBDH. Volij and Esteban developed an annotation standard to detect intrasocial support from the electronic medical records . Lybarger et al. recently used an active learning framework and developed the Social History Annotation Corpus (SHAC), including 4480 social history sections for 12 SBDH characterizing the status, extent, and temporal information . The actively selected samples improved performance in both the surrogate task and the target event extraction task. A neural multitask model was presented for characterizing substance use, employment, and living status across multiple dimensions, including status, extent, and temporal fields. The event extractor model achieves high performance on the MIMIC and UW Dataset: 0.89-0.98 F1 for identifying distinct SBDH events, 0.82-0.93 F1 for substance use status, 0.81-0.86 F1 for employment status, and 0.81-0.93 F1 for living status type.
3.5.7. Conversational Agent
One study designed a study to analyze the significance of employing Alexa-based intelligent agents for patient coaching. Their study has shown that intelligent agents are another highly efficient model for intervention. Furthermore, they claimed that this approach has the potential to reshape the way people apply interventions .
3.6. Predictive Models Using SBDH for Healthcare Outcomes
Among these studies, 57 studies used SBDH factors for predicting healthcare outcomes. Figure 5 also summarizes the sources of SBDH for predictive modeling among included studies. Predictive modeling is a technique that uses mathematical and computational methods to predict an event or outcome in a future time point of interest. In most cases, a model is chosen based on a detection theory to try to guess the probability of an outcome given a set amount of input variables. In general, these models can make use of either one or more classifiers in order to determine the probability of a set of data belonging to another set or according to the undertaken task. Below we categorized these studies based on different methods of predictive modelling techniques. Note that one study can mention multiple predictive models, and thus, these studies were counted for each methodology category.
3.6.1. Supervised Machine Learning Methods
(1) Regression Models. Regression is the mostly used approach for predicting outcomes. Sixteen studies [21, 23, 25, 30, 36, 43, 45, 50, 52, 53, 56, 63, 65, 75, 91, 99] used various types of regression models, including logistic regression, least absolute shrinkage and selection operator (LASSO) regression, and Cox proportional models. Kim et al. used logistic regression to develop predictors for suicide using various suicidal behaviors and substance-related variables. After adjusting for administratively available data, they found that prescription drug misuse had an odds ratio (OR) of 6.8 (95% CI, 2.5-18.5); history of suicide attempts, 6.6 (95% CI, 1.7-26.4); and alcohol abuse/dependence, 3.3 (95% CI, 1.9-5.7) which were major predictors for suicide. Difficulty with access to health care was also a predictor of suicide (; 95% CI, 1.3-6.3).
(2) Random Forest and Decision Tree. Twelve studies [24, 28, 49, 54, 58, 60, 64, 70, 84, 92, 93, 95] used tree-based machine learning (ML) algorithms, including random forest and decision trees. Nau et al. utilized nonparametric machine learning methods such as Conditional Random Forest (CRF) to identify the combination of community features that are most important for the prediction of obesogenic and obesoprotective environments for children , whereas Agrawal et al. also used random forests along with gradient boosting methods and stacked generalization methods to attain their outcome using structured data . Davoudi et al. , Walsh et al. , Grinspan et al. , Feller et al. , and Erickson et al.  all made use of random forest variants. Grinspan et al.  showed through bivariate analyses that there are multiple potential predictors of emergency department use such as demographics, SBDH, comorbidities, insurance, disease severity, and prior health care utilization. The paper used EHR data to predict ED use in two centers. Random forest model with and -3.1 tied with a 3-variable model at one of the two centers and the latter outperformed at the remaining center.
(3) Neural Networks. Eight studies utilized neural networks [27, 39, 44, 52, 53, 65, 87, 100]. Shi et al.  implement bidirectional RNN to predict pediatric diagnosis whereas Vrbaški et al.  develop predictors for lipid profile prediction. Both methods use a combination of structured and unstructured data with various natural language preprocessing steps. Xue et al.  utilized an RNN-based time-aware architecture to predict obesity status. An ensemble model with cross-sectional random forest (RF) model and a longitudinal recurrent neural network (RNN) model with the Long Short-Term Memory (LSTM) architecture are built by Zhang-James et al.  to predict at-risk comorbid SUDs in individuals with ADHD improvement. Subsequently, they found that population registry data and linked EHRs can be used reliably to predict at-risk comorbid SUDs in individuals with ADHD. Finally, one of the important empirical observations from this paper is that risk monitoring over years during child development can be achieved using a longitudinal LSTM model which was able to predict later SUD risks at as early as 2 years of age, 10 years before the earliest diagnosis with an average AUC of 0.63.
(4) Support Vector Machines (SVM). Studies that used SVMs for prediction are relatively few (five studies) [26, 65, 72, 80] as compared to other methods which are discussed above. Wang et al. compare between back-propagation neural network (BPNN), SVM, and logistic regression (LR) models to predict CD patients of nonadherence to azathioprine (AZA) and reports that SVM has the best performance . Davoudi et al. use various ML models including SVM to predict the risk of delirium using preoperative EHR data . There are other interesting clinical studies like , which also used SVM-based methodologies.
3.6.2. Unsupervised Machine Learning Methods
A couple of studies used unsupervised ML methods. Afshar et al.  used LDA to identify subtypes of patients with opioid misuse whereas Cui et al.  used K-means clustering and principal component analysis (PCA) to analyze and discover latent clusters in COVID-19 patients. Kirk et al.  discuss an algorithm using unsupervised Markov clustering (MCL) and perform a phenotypic characterization of a Danish diabetes cohort. The stratification of the diabetes cohort is based on characteristics extracted from the unstructured EHR records of the target (homogenous) population, where these characteristics include several diagnoses and lifestyle factors. Patient clusters are obtained by exploiting unsupervised MCL along with other NLP techniques.
Thus, predictive modeling techniques are used to develop markers for predicting required clinical outcomes using a set of characteristics and behaviors. In most of the studies, various SBDH are either part of features or characteristics that define a specific cohort and act as markers of a predefined outcome. There are also some studies that analyze the effect of adding SBDH into risk prediction models and whether it improves prediction accuracy for some specific ailments. In another line of research, they analyze individual contributions of SBDH at the patient level, informing appropriate interventions that can reduce the risk of negative health outcomes such as preventable readmissions and/or hospitalizations.
SBDH (or SDOH) research has become an active and interdisciplinary research domain, covering healthcare, informatics, computer science, and social science. It is important to acknowledge that SBDH factors have a major impact on health outcomes. Our review indicates that various SBDH categories have been investigated with a wide range of disease categories. The most studied SBDH factors are substance abuse, employment status, and socioeconomic status, whereas other important SBDH factors are understudied, such as social environment, psychosocial factors, and racial disparities potentially due to the lack of data. The most studied disease areas using SBDH are mental and endocrine disorders. Other severe diseases (e.g., cancers) for which SBDH could be important factors for healthcare outcomes are rarely studied using the extensive information in EHRs. The associations between socioeconomic factors and health outcomes are complicated and diverse; several pathways may be involved . We observed that multiple SBDH factors are being investigated in a single disease and, among the many factors, the number of SBDH factors that may potentially affect the condition of the disease is still questionable. The wide range of SBDH factors to be considered while examining the patient could indeed overwhelm the physician and may have an effect on decision-making and policy-making . From our analysis, we found that most of the studies we reviewed focused on mid- and downstream SBDH factors and not upstream, like governance and policy . Research on how SBDH influences established pathways that contribute to health inequalities is much required.
EHR has rich information on patients’ health conditions and treatment process; however, the representation of SBDH in EHR is still limited. Certain data is clearly stated in clinical text (for example, documentation of drug and alcohol use); nonetheless, a considerable proportion of information about certain SBDH factors such as social environment is not directly indicated in the clinical note but can be inferred. It is anticipated that the majority of clinical data in the EHR is unstructured and hence difficult to examine. Data such as physician notes, nurse notes, discharge summaries, and patient-reported information have the potential to contribute a plethora of essential clinical information, but are often unusable, depriving a significant method of improving population health. Using NLP technology to support the extraction of SBDH and other clinical ideas simplifies the identification and extraction of essential concepts from clinical data, efficiently unlocks unstructured data, and aids in the resolution of unstructured data-related issues. Better-informed population health decisions can be made with the use of accurate and complete SBDH information.
In this review, most studies focus on one or more SBDH factors. We can observe from the content related to the association between SBDH and healthcare-related statuses that SBDH is recognized to have a potential effect to comprehend patients’ health status. Even with several papers focusing on development of NLP algorithms to extract SBDH from clinical notes, there is still a data bias regarding the representation and completeness of SBDH in EHR. Individual level SBDH information usually contributes to the accuracy of predictive models; however, they are usually hard to capture accurately or completely in EHR. In this review, we only focused on NLP techniques that were used to extract one or more SBDH aspects from clinical records. Our analysis indicates that IE for SBDH has been dominated by rule-based approaches, including rule-based NLP tools and methods. Half of the studies used existing clinical NLP tools, several of which (e.g., cTAKES and MetaMap) were widely used in other domains as well. Manually curated key terms were also widely used for rule-based methods. These findings are consistent with the recent review on NLP methodology for clinical IE . Due to the variability of clinical concepts and limitation of hand crafted term list, unsupervised ML methods (word embeddings) were commonly used to expand term lists by finding their semantically similar terms in the clinical corpora. Topic modeling was another commonly used approach to cluster key terms in a coherent and latent topic. These methods often need manual checks to confirm the final term list or appropriate number of topics. Very few clinical data corpora on SBDH were available, which leads to limited studies using supervised ML methods. However, there were a couple of studies investigating how to develop SBDH-specific corpus. One study utilized active learning methods, which has demonstrated to save human effort for annotations in other studies [104–106]. Deep learning models have shown promising results and predominate in the general NLP domain; however, only 2 studies were found to use deep learning methods in the analyzed literature. Developing accurate deep learning models in SBDH requires a large amount of annotated training data, which is a time-consuming and labor intensive process. One possible solution is to use advanced IE techniques, such as distant supervision , which automatically or semiautomatically generate weak labels for training deep learning models. Predictive models were widely used to investigate the association between SBDH factors and health outcomes. Such outcomes include specific diseases and administration aspects, such as readmission. There are national and regional efforts contributing to the integration of SBDH into EHRs, including establishing national standards  for SBDH data collection and representation [108, 109] and developing SBDH integration tools . Various SBDH integration tools have emerged in order to collect more SBDH documentation in EHR. Thus, there are still a lot of efforts the community should work together to address SBDH data bias in EHR data.
AI has impacted every aspect of our daily life, from product recommendations to intelligent personal assistants, powered by the availability of large volumes of data. The increasing adoption of EHR systems in healthcare organizations has fostered the secondary use of EHR data in AI techniques to improve patient care outcomes, through clinical decision support, chronic disease management, patient education, and so forth. However, similar to human beings, AI algorithms are vulnerable to biases that may result in unfair decisions. For example, the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) system produces a higher risk score to African-American compared to Caucasians when it is used by judges to measure the risk of committing another crime . In the context of healthcare settings, bias in AI algorithms may result in more serious issues, such as affecting patients’ safety, delaying treatment, and even risking lives. Therefore, addressing bias in AI algorithms is crucial for successfully deploying AI applications in healthcare and clinical practice . AI bias can be summarized into two categories, i.e., data bias and algorithm bias. While algorithm bias is related to the algorithm design and model architecture that is not trivial to mitigate, data bias is relatively easier to address by carefully selecting unbiased cohorts and datasets. SBDH could serve as a metric to measure the bias in EHR data and select a fairness cohort to train AI algorithms. For example, socioeconomic status, an important SBDH variable, could be used to ensure training data better represents patients with different socioeconomic status.
Health equity issues caused by complex, integrated, and overlapping social structures and economic systems are gaining recognition among scientists and public health professionals. SBDH is an important indicator for health equity since it indicates whether people have access to adequate diet, medical care, educational, and career opportunities, what are their healthy environmental conditions, and whether a person is exposed to physical or psychologic trauma . SBDH helps us develop comprehensive strategies to address potential risks for the population, particularly for the underserved populations. It is well documented that the social conditions impact premature mortality in underserved communities. SBDH are responsible for many of the leading health disparities in the US. Gaining insight on the SBDH and how SBDH information could be extracted from EHRs improves the opportunities to increase wellness, prevent premature illness, gives health care teams the insight needed to increase patient action (i.e., adherence, behavior change, and compliance), provides needed information to influence health policy change for wellness, and eventually promotes health and health equity .
This review has several limitations. First, our search terms and databases might be insufficient to cover all studies. We only included articles written in English. Due to the fact that the definition of SBDH is very broad and not specifically defined in the literature, the searching keywords used in this review might not be sufficient for searching all SBDH-related articles. Second, we only focus on studies that developed or adapted AI methods in EHR data for SBDH research. Several studies utilizing non-EHR data to study SBDH, such as clinical trial data , survey data , or claim data [13, 17], were excluded from this review.
In summary, this scoping review discussed the current trends, challenges, and future directions on using AI methods on SBDH in the EHR data. Our analysis indicates that despite known associations between SBDH and diseases, SBDH factors are not commonly examined as interventions to improve the patient’s healthcare outcomes. Gaining insights into SBDH and how SBDH data could be extracted from EHRs using NLP techniques and predictive models improves the opportunities to influence health policy change for patient wellness and eventually to promote health and health equity.
Conflicts of Interest
We declare no competing interests.
Anusha Bompelli and Yanshan Wang contributed equally to this work.
YW was supported by the Mayo Clinic Center for Health Equity and Community Engagement Research Award. RZ was partially supported by the National Institutions of Health’s National Center for Complementary & Integrative Health (NCCIH), the Office of Dietary Supplements (ODS) and National Institute on Aging (NIA) grant number R01AT009457 (PI: Zhang).
Supplemental Document 1: literature searching strategies on Ovid, Scopus, Web of Science, ACM digital library, and IEEE Xplore. Supplementary Table 1: information extracted from the included studies. (Supplementary Materials)
- “Social Determinants of Health - Healthy People 2030| http://health.gov,” January 2021, https://health.gov/healthypeople/objectives-and-data/social-determinants-health.
- M. R. Sterling, J. B. Ringel, L. C. Pinheiro et al., “Social determinants of health and 90-day mortality after hospitalization for heart failure in the REGARDS study,” Journal of the American Heart Association, vol. 9, no. 9, article e014836, 2020.
- G. K. Singh, G. Daus, M. Allender et al., “Social determinants of health in the United States: addressing major health inequality trends for the nation, 1935-2016,” International Journal of MCH and AIDS (IJMA), vol. 6, no. 2, pp. 139–164, 2017.
- C. Eppes, M. Salahuddin, P. S. Ramsey, C. Davidson, and D. A. Patel, “Social determinants of health and severe maternal morbidity during delivery hospitalizations in texas [36L],” Obstetrics and Gynecology, vol. 135, p. 133S, 2020.
- M. Bush, “Addressing the Root Cause,” North Carolina Medical Journal, vol. 79, no. 1, pp. 26–29, 2018.
- H. J. Heiman and S. Artiga, Beyond Health Care: The Role of Social Determinants in Promoting Health and Health Equity, 2015.
- S. Galea, M. Tracy, K. J. Hoggatt, C. DiMaggio, and A. Karpati, “Estimated deaths attributable to social factors in the United States,” American Journal of Public Health, vol. 101, no. 8, pp. 1456–1465, 2011.
- J. J. Deferio, S. Breitinger, D. Khullar, A. Sheth, and J. Pathak, “Social determinants of health in mental health care and research: a case for greater inclusion,” Journal of the American Medical Informatics Association, vol. 26, no. 8-9, pp. 895–899, 2019.
- Y. Zhou, F. Wang, J. Tang, R. Nussinov, and F. Cheng, “Artificial intelligence in COVID-19 drug repurposing,” The Lancet Digital Health, vol. 2, no. 12, pp. e667–e676, 2020.
- “Completeness of social and behavioral determinants of health in electronic health records: a case study on the patient-provided information from a minority cohort with sexually transmitted diseases,” 2020.
- M. Chen, X. Tan, and R. Padman, “Social determinants of health in electronic health records and their impact on analysis and risk prediction: a systematic review,” Journal of the American Medical Informatics Association, vol. 27, no. 11, pp. 1764–1773, 2020.
- E. Hatef, M. Rouhizadeh, I. Tia et al., “Assessing the availability of data on social and behavioral determinants in structured and unstructured electronic health records: a retrospective analysis of a multilevel health care system,” JMIR Medical Informatics, vol. 7, no. 3, article e13802, 2019.
- J. R. Curtis, H. Yun, C. J. Etzel, S. Yang, and L. Chen, “Use of machine learning and traditional statistical methods to classify ra-related disability using administrative claims data,” Arthritis Rheumatol. Conf. Am. Coll. Rheumatol. Rheumatol. Health Prof. Annu. Sci. Meet. ACRARHP, vol. 69, Supplement 10, 2017.
- A. S. Navathe, F. Zhong, V. J. Lei et al., “Hospital readmission and social risk factors identified from physician notes,” Health Services Research, vol. 53, no. 2, pp. 1110–1136, 2018.
- H. Kharrazi, L. J. Anzaldi, L. Hernandez et al., “The value of unstructured electronic health record data in geriatric syndrome case identification,” Journal of the American Geriatrics Society, vol. 66, no. 8, pp. 1499–1507, 2018.
- J. Erickson, K. Abbott, and L. Susienka, “Automatic address validation and health record review to identify homeless social security disability applicants,” Journal of Biomedical Informatics, vol. 82, pp. 41–46, 2018.
- R. J. Desai, S. Wang, M. Vaduganathan, and S. Schneeweiss, “Abstracts,” Pharmacoepidemiology and Drug Safety, vol. 28, Supplement 2, no. S2, pp. 5–586, 2019.
- A. Bompelli, G. Silverman, R. Finzel et al., Comparing NLP systems to extract entities of eligibility criteria in dietary supplements clinical trials using NLP-ADAPT, vol. 12299, LNAI. Springer Science and Business Media Deutschland GmbH, 2020.
- World Health Organization, “A conceptual framework for action on the social determinants of health: debates, policy & practice, case studies,” 2010, January 2021, http://apps.who.int/iris/bitstream/10665/44489/1/9789241500852_eng.pdf.
- K. Lybarger, M. Ostendorf, and M. Yetisgen, “Annotating social determinants of health using active learning, and characterizing determinants using neural event extraction,” Journal of Biomedical Informatics, vol. 113, p. 103631, 2021.
- S. A. Berkowitz, S. Basu, A. Venkataramani, G. Reznor, E. W. Fleegler, and S. J. Atlas, “Association between access to social service resources and cardiometabolic risk factors: a machine learning and multilevel modeling analysis,” BMJ Open, vol. 9, no. 3, article e025281, 2019.
- X. Zhou, Y. Wang, S. Sohn, T. M. Therneau, H. Liu, and D. S. Knopman, “Automatic extraction and assessment of lifestyle exposures for Alzheimer's disease using natural language processing,” International Journal of Medical Informatics, vol. 130, p. 103943, 2019.
- L. Tong, C. Erdmann, M. Daldalian, J. Li, and T. Esposito, “Comparison of predictive modeling approaches for 30-day all-cause non-elective readmission risk,” BMC Medical Research Methodology, vol. 16, no. 1, p. 26, 2016.
- A. Davoudi, T. Ozrazgat-Baslanti, A. Ebadi, A. C. Bursian, A. Bihorac, and P. Rashidi, “Delirium prediction using machine learning models on predictive electronic health records data,” in 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE), pp. 568–573, Washington, DC, USA, October 2017.
- K. M. Corey, S. Kashyap, E. Lorenzi et al., “Development and validation of machine learning models to identify high-risk surgical patients using automatically curated electronic health record data (Pythia): a retrospective, single-site study,” PLoS Medicine, vol. 15, no. 11, article e1002701, 2018.
- G. S. Kerr, J. S. Richards, C. A. Nunziato et al., “Measuring physician adherence with gout quality indicators: a role for natural language processing,” Arthritis Care and Research, vol. 67, no. 2, pp. 273–279, 2015.
- M. Jamei, A. Nisnevich, E. Wetchler, S. Sudat, and E. Liu, “Predicting all-cause risk of 30-day hospital readmission using artificial neural networks,” PLoS One, vol. 12, no. 7, article e0181173, 2017.
- D. Agrawal, C. B. Chen, R. W. Dravenstott et al., “Predicting patients at risk for 3-day postdischarge readmissions, ED visits, and deaths,” Medical Care, vol. 54, no. 11, pp. 1017–1023, 2016.
- F. Rahimian, G. Salimi-Khorshidi, A. H. Payberah et al., “Predicting the risk of emergency admission with machine learning: development and validation using linked electronic health records,” PLOS Medicine, vol. 15, no. 11, article e1002695, 2018.
- H. M. Kim, E. G. Smith, D. Ganoczy et al., “Predictors of suicide in patient charts among patients with depression in the Veterans Health Administration health system: importance of prescription drug and alcohol abuse,” The Journal of Clinical Psychiatry, vol. 73, no. 10, pp. e1269–e1275, 2012.
- E. A. Lindemann, E. S. Chen, Y. Wang, S. J. Skube, and G. B. Melton, “Representation of social history factors across age groups: a topic analysis of free-text social documentation,” AMIA Annu. Symp. ProceedingsAMIA Symp., vol. 2017, pp. 1169–1178, 2017.
- M. Afzal, M. Hussain, W. A. Khan, T. Ali, A. Jamshed, and S. Lee, “Smart extraction and analysis system for clinical research,” Telemedicine Journal and E-Health, vol. 23, no. 5, pp. 404–420, 2017.
- D. J. Feller, J. Zucker, O. Bear Don't Walk IV et al., “Towards the inference of social and behavioral determinants of sexual health: development of a gold-standard corpus with semi-supervised learning,” AMIA Annu. Symp. ProceedingsAMIA Symp., vol. 2018, pp. 422–429, 2018.
- D. A. Thompson, D. M. Courtney, S. Malik, M. Schmidt, and V. Weston, “Use of natural language processing to identify 414 different chief complaints in adult emergency department patients,” Academic Emergency Medicine, vol. 25, Supplement 1, p. S193, 2018.
- D. J. Feller, J. Zucker, M. T. Yin, P. Gordon, and N. Elhadad, “Using natural language processing to extract social determinants of health and improve 30-day readmission models,” Journal of General Internal Medicine, vol. 32, 2 Supplement 1, p. S370, 2017.
- A. H. S. Harris, A. C. Kuo, T. R. Bowe, L. Manfredi, N. F. Lalani, and N. J. Giori, “Can machine learning methods produce accurate and easy-to-use preoperative prediction models of one-year improvements in pain and functioning after knee arthroplasty?” The Journal of Arthroplasty, vol. 36, no. 1, pp. 112–117.e6, 2021.
- D. A. DuBay, Z. Su, T. A. Morinelli et al., “Development and future deployment of a 5 years allograft survival model for kidney transplantation,” Nephrology, vol. 24, no. 8, pp. 855–862, 2019.
- B. M. Hollister, N. A. Restrepo, E. Farber-Eger, D. C. Crawford, M. C. Aldrich, and A. Non, “Development and performance of text-mining algorithms to extract socioeconomic status from de-identified electronic health records,” Pacific Symposium on Biocomputing, vol. 22, pp. 230–241, 2017.
- L. Zheng, O. Wang, S. Hao et al., “Development of an early-warning system for high-risk patients for suicide attempt using deep learning and electronic health records,” Translational Psychiatry, vol. 10, no. 1, p. 72, 2020.
- L. Wang, J. Lakin, C. Riley, Z. Korach, L. N. Frain, and L. Zhou, “Disease trajectories and end-of-life care for dementias: latent topic modeling and trend analysis using clinical notes,” AMIA Annu. Symp. ProceedingsAMIA Symp., vol. 2018, pp. 1056–1065, 2018.
- M. Richard, X. Aimé, M. O. Krebs, and J. Charlet, “Enrich classifications in psychiatry with textual data: an ontology for psychiatry including social concepts,” Studies in Health Technology and Informatics, vol. 210, pp. 221–223, 2015.
- M. Senior, M. Burghart, R. Yu et al., “Identifying predictors of suicide in severe mental illness: a feasibility study of a clinical prediction rule (Oxford Mental Illness and Suicide Tool or OxMIS),” Frontiers in Psychiatry, vol. 11, p. 268, 2020.
- A. Hassoon, J. Schrack, D. Naiman et al., “Increasing physical activity amongst overweight and obese cancer survivors using an Alexa-based intelligent agent for patient coaching: protocol for the Physical Activity by Technology Help (PATH) trial,” JMIR Research Protocols, vol. 7, no. 2, pp. e27–e27, 2018.
- Y. Zhang-James, Q. Chen, R. Kuja-Halkola, P. Lichtenstein, H. Larsson, and S. V. Faraone, “Machine-learning prediction of comorbid substance use disorders in ADHD youth using Swedish registry data,” Journal of Child Psychology and Psychiatry, vol. 61, no. 12, pp. 1370–1379, 2020.
- Q. Xue, X. Wang, S. Meehan, J. Kuang, J. A. Gao, and M. C. Chuah, “Recurrent neural networks based obesity status prediction using activity data,” in 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 865–870, Orlando, FL, USA, December 2018.
- M. Afshar, C. Joyce, D. Dligach et al., “Subtypes in patients with opioid misuse: a prognostic enrichment strategy using electronic health record data in hospitalized patients,” PLoS ONE Electron. Resour., vol. 14, no. 7, article e0219717, 2019.
- J. P. Lalor, B. Woolf, and H. Yu, “Improving electronic health record note comprehension with NoteAid: randomized trial of electronic health record note comprehension interventions with crowdsourced workers,” Journal of Medical Internet Research, vol. 21, no. 1, article e10793, 2019.
- C. Dillahunt-Aspillaga, D. Finch, J. Massengale, T. Kretzmer, S. L. Luther, and J. A. McCart, “Using information from the electronic health record to improve measurement of unemployment in service members and veterans with mTBI and post-deployment stress,” PLoS One, vol. 9, no. 12, article e115873, 2014.
- S. J. Patel, D. Chamberlain, and J. M. Chamberlain, “A machine-learning approach to predicting need for hospitalization for pediatric asthma exacerbation at the time of emergency department triage,” Pediatr. Conf. Natl. Conf. Educ., vol. 142, no. 1, p. e115873, 2017.
- A. Seveso, V. Bozzetti, P. Tagliabue, M. L. Ventura, and F. Cabitza, “Developing a machine learning model for predicting postnatal growth in very low birth weight infants,” in Proceedings of the 13th International Joint Conference on Biomedical Engineering Systems and Technologies, pp. 490–497, Valletta, Malta, 2020.
- E. A. Wang, J. B. Long, K. A. McGinnis et al., “Measuring exposure to incarceration using the electronic health record,” Medical Care, vol. 57, Supplement 2, pp. S157–S163, 2019.
- A. Shaham, G. Chodick, V. Shalev, and D. Yamin, “Personal and social patterns predict influenza vaccination decision,” BMC Public Health, vol. 20, no. 1, p. 222, 2020.
- Q. Chen, Y. Zhang-James, E. J. Barnett et al., “Predicting suicide attempt or suicide death following a visit to psychiatric specialty care: a machine learning study using Swedish national registry data,” PLoS Med. Public Libr. Sci., vol. 17, no. 11, article e1003416, 2020.
- J. R. Vest and O. Ben-Assuli, “Prediction of emergency department revisits using area-level social determinants of health measures and health information exchange information,” International Journal of Medical Informatics, vol. 129, pp. 205–210, 2019.
- S. Biro, T. Williamson, J. A. Leggett et al., “Utility of linking primary care electronic medical records with Canadian census data to study the determinants of chronic disease: an example based on socioeconomic status and obesity,” BMC Medical Informatics and Decision Making, vol. 16, no. 1, p. 32, 2016.
- N. A. Bhavsar, A. Gao, M. Phelan, N. J. Pagidipati, and B. A. Goldstein, “Value of neighborhood socioeconomic status in predicting risk of outcomes in studies that use electronic health record data,” JAMA Network Open, vol. 1, no. 5, article e182716, 2018.
- A. Dagliati, L. Sacchi, V. Tibollo et al., “A dashboard-based system for supporting diabetes care,” Journal of the American Medical Informatics Association, vol. 25, no. 5, pp. 538–547, 2018.
- K.-M. Kuo, P. C. Talley, M. Kuzuya, and C. H. Huang, “Development of a clinical support system for identifying social frailty,” International Journal of Medical Informatics, vol. 132, p. 103979, 2019.
- H. Zhang, N. Hosomura, M. Shubina, D. C. Simonson, M. A. Testa, and A. Turchin, “Electronic documentation of lifestyle counseling in primary care is associated with lower risk of cardiovascular events in patients with diabetes,” Diabetes, vol. 65, Supplement 1, p. A363, 2016.
- K. Shoenbill, Y. Song, M. Craven, H. Johnson, M. Smith, and E. A. Mendonca, “Identifying patterns and predictors of lifestyle modification in electronic health record documentation using statistical and machine learning methods,” Preventive Medicine, vol. 136, p. 106061, 2020.
- I. K. Kirk, C. Simon, K. Banasik et al., “Linking glycemic dysregulation in diabetes to symptoms, comorbidities, and genetics through EHR data mining,” eLife, vol. 8, no. 12, p. 10, 2019.
- K. Shoenbill, Y. Song, L. Gress, H. Johnson, M. Smith, and E. A. Mendonca, “Natural language processing of lifestyle modification documentation,” Health Informatics Journal, vol. 26, no. 1, pp. 388–405, 2020.
- A. Ferri, R. Rosati, M. Bernardini et al., “Towards the design of a machine learning-based consumer healthcare platform powered by electronic health records and measurement of lifestyle through smartphone data,” in 2019 IEEE 23rd International Symposium on Consumer Technologies (ISCT), pp. 37–40, Ancona, Italy, June 2019.
- C. G. Walsh, J. D. Ribeiro, and J. C. Franklin, “Predicting suicide attempts in adolescents with longitudinal clinical data and machine learning,” Journal of Child Psychology and Psychiatry, vol. 59, no. 12, pp. 1261–1270, 2018.
- L. Wang, R. Fan, C. Zhang et al., “Applying machine learning models to predict medication nonadherence in Crohn’s disease maintenance therapy,” Patient Preference and Adherence, vol. 14, pp. 917–926, 2020.
- J. P. Anderson, J. R. Parikh, D. K. Shenfeld et al., “Reverse engineering and evaluation of prediction models for progression to type 2 diabetes: an application of machine learning using electronic health records,” Journal of Diabetes Science and Technology, vol. 10, no. 1, pp. 6–18, 2016.
- W. Cui, D. Robins, and J. Finkelstein, “Unsupervised machine learning for the discovery of latent clusters in COVID-19 patients using electronic health records,” Studies in Health Technology and Informatics, vol. 272, pp. 1–4, 2020.
- M. Poulymenopoulou, D. Papakonstantinou, F. Malamateniou, and G. Vassilacopoulos, “A health analytics semantic ETL service for obesity surveillance,” Studies in Health Technology and Informatics, vol. 210, pp. 840–844, 2015.
- B. L. Hazlehurst, J. M. Lawrence, W. T. Donahoo et al., “Automating assessment of lifestyle counseling in electronic health records,” American Journal of Preventive Medicine, vol. 46, no. 5, pp. 457–464, 2014.
- C. Nau, H. Ellis, H. Huang et al., “Exploring the forest instead of the trees: an innovative method for defining obesogenic and obesoprotective environments,” Health & Place, vol. 35, pp. 136–146, 2015.
- L. Williamson, C. Wojcik, M. Taunton et al., “Finding undiagnosed patients with familial hypercholesterolemia in primary care usingelectronic health records,” Journal of the American College of Cardiology, vol. 75, no. 11, p. 3502, 2020.
- A. Tragomalou, G. Moschonis, Y. Manios et al., “Novel e-health applications for the management of cardiometabolic risk factors in children and adolescents in Greece,” Nutrients, vol. 12, no. 5, p. 1380, 2020.
- C. A. Bejan, J. Angiolillo, D. Conway et al., “Mining 100 million notes to find homelessness and adverse childhood experiences: 2 case studies of rare and severe social determinants of health in electronic health records,” Journal of the American Medical Informatics Association, vol. 25, no. 1, pp. 61–71, 2018.
- M. Conway, S. Keyhani, L. Christensen et al., “Moonstone: a novel natural language processing system for inferring social risk from clinical narratives,” Journal of Biomedical Semantics, vol. 10, no. 1, p. 6, 2019.
- T. Byrne, A. E. Montgomery, and J. D. Fargo, “Predictive modeling of housing instability and homelessness in the Veterans Health Administration,” Health Services Research, vol. 54, no. 1, pp. 75–85, 2019.
- C. Dalton-Locke, J. H. Thygesen, N. Werbeloff, D. Osborn, and H. Killaspy, “Using de-identified electronic health records to research mental health supported housing services: a feasibility study,” PLoS One, vol. 15, no. 8, article e0237664, 2020.
- V. J. Zhu, L. A. Lenert, B. E. Bunnell, J. S. Obeid, M. Jefferson, and C. H. Halbert, “Automatically identifying social isolation from clinical narratives for patients with prostate cancer,” BMC Medical Informatics and Decision Making, vol. 19, no. 1, p. 43, 2019.
- L. J. Anzaldi, A. Davison, C. M. Boyd, B. Leff, and H. Kharrazi, “Comparing clinician descriptions of frailty and geriatric syndromes using electronic health records: a retrospective cohort study,” BMC Geriatrics, vol. 17, no. 1, p. 248, 2017.
- C. Volij and S. Esteban, “Development of a systematic text annotation standard to extract social support information form electronic medical records,” Studies in Health Technology and Informatics, vol. 270, pp. 1261-1262, 2020.
- T. Chen, M. Dredze, J. P. Weiner, and H. Kharrazi, “Identifying vulnerable older adult populations by contextualizing geriatric syndrome information in clinical notes of electronic health records,” Journal of the American Medical Informatics Association, vol. 26, no. 8–9, pp. 787–795, 2019.
- A. T. Bako, H. Walter-McCabe, S. N. Kasthurirathne, P. K. Halverson, and J. R. Vest, “Reasons for social work referrals in an urban safety-net population: a natural language processing and market basket analysis approach,” Journal of Social Service Research, vol. 47, no. 3, pp. 414–425, 2021.
- B. Olatosi, J. Zhang, S. Weissman, J. Hu, M. R. Haider, and X. Li, “Using big data analytics to improve HIV medical care utilisation in South Carolina: a study protocol,” BMJ Open, vol. 9, no. 7, article e027688, 2019.
- J. L. Greenwald, P. R. Cronin, V. Carballo, G. Danaei, and G. Choy, “A novel model for predicting rehospitalization risk incorporating physical function, cognitive status, and psychosocial support using natural language processing,” Medical Care, vol. 55, no. 3, pp. 261–266, 2017.
- B. T. Bucher, J. Shi, R. J. Pettit, J. Ferraro, W. W. Chapman, and A. Gundlapalli, “Determination of marital status of patients from structured and unstructured electronic healthcare data,” AMIA Annu. Symp. ProceedingsAMIA Symp., vol. 2019, pp. 267–274, 2019.
- F. Ge, J. Jiang, Y. Wang, C. Yuan, and W. Zhang, “Identifying suicidal ideation among Chinese patients with major depressive disorder: evidence from a real-world hospital-based study in China,” Neuropsychiatric Disease and Treatment, vol. 16, pp. 665–672, 2020.
- D. Dorr, C. A. Bejan, C. Pizzimenti, S. Singh, M. Storer, and A. Quinones, “Identifying patients with significant problems related to social determinants of health with natural language processing,” Studies in Health Technology and Informatics, vol. 264, pp. 1456-1457, 2019.
- J. Shi, X. Fan, J. Wu, J. Chen, and W. Chen, “DeepDiagnosis: DNN-based diagnosis prediction from pediatric big healthcare data,” in Proceedings -2018 6th International Conference on Advanced Cloud and Big Data, CBD 2018, pp. 287–292, Lanzhou, China, 2018.
- Z. M. Grinspan, A. D. Patel, B. Hafeez, E. L. Abramson, and L. M. Kern, “Predicting frequent emergency department use among children with epilepsy: a retrospective cohort study using electronic health data from 2 centers,” Epilepsia, vol. 59, no. 1, pp. 155–169, 2018.
- A. V. Gundlapalli, M. E. Carter, M. Palmer et al., “Using natural language processing on the free text of clinical documents to screen for evidence of homelessness among US veterans,” AMIA Annu. Symp. Proc. AMIA Symp., vol. 2013, pp. 537–546, 2013.
- A. V. Gundlapalli, M. E. Carter, G. Divita et al., “Extracting concepts related to homelessness from the free text of VA electronic medical records,” AMIA Annu. Symp. Proc. AMIA Symp., vol. 2014, pp. 589–598, 2014.
- R. Suchting, C. E. Green, S. M. Glazier, and S. D. Lane, “A data science approach to predicting patient aggressive events in a psychiatric hospital,” Psychiatry Research, vol. 268, pp. 217–222, 2018.
- E. Brignone, J. D. Fargo, R. K. Blais, and A. V. Gundlapalli, “Applying machine learning to linked administrative and clinical data to enhance the detection of homelessness among vulnerable veterans,” AMIA Annu. Symp. Proc. AMIA Symp., vol. 2018, pp. 305–312, 2018.
- D. J. Feller, O. J. B. D.’t. W. Iv, J. Zucker, M. T. Yin, P. Gordon, and N. Elhadad, “Detecting social and behavioral determinants of health with structured and free-text clinical data,” Applied Clinical Informatics, vol. 11, no. 1, pp. 172–181, 2020.
- D. Schillinger, R. Balyan, S. A. Crossley, D. S. McNamara, J. Y. Liu, and A. J. Karter, “Employing computational linguistics techniques to identify limited patient health literacy: findings from the ECLIPPSE study,” Health Services Research, vol. 23, p. 23, 2020.
- D. J. Feller, J. Zucker, M. T. Yin, P. Gordon, and N. Elhadad, “Using clinical notes and natural language processing for automated HIV risk assessment,” JAIDS Journal of Acquired Immune Deficiency Syndromes, vol. 77, no. 2, pp. 160–166, 2018.
- Y. Wang, L. Wang, M. Rastegar-Mojarad et al., “Clinical information extraction applications: a literature review,” Journal of Biomedical Informatics, vol. 77, pp. 34–49, 2018.
- L. Zhou, J. M. Plasek, L. M. Mahoney et al., “Using Medical Text Extraction, Reasoning and Mapping System (MTERMS) to process medication information in outpatient clinical notes,” AMIA Annu. Symp. Proc. AMIA Symp., vol. 2011, pp. 1639–1648, 2011.
- B. Hazlehurst, H. R. Frost, D. F. Sittig, and V. J. Stevens, “MediClass: a system for detecting and classifying encounter-based clinical events in any electronic medical record,” Journal of the American Medical Informatics Association, vol. 12, no. 5, pp. 517–529, 2005.
- D. Yamin, A. Shaham, G. Chodick, and V. Shalev, “Personal and social patterns predict influenza vaccination decision,” Israel Journal of Health Policy Research, vol. 8, Supplement 1, 2019.
- M. Vrbaški, R. Doroslovački, A. Kupusinac, E. Stokić, and D. Ivetić, “Lipid profile prediction based on artificial neural networks,” Journal of Ambient Intelligence and Humanized Computing, 2019.
- P. Braveman and L. Gottlieb, “The social determinants of health: it’s time to consider the causes of the causes,” Public Health Reports, vol. 129, 1_Supplement 2, pp. 19–31, 2014.
- M. M. Islam, “Social determinants of health and related inequalities: confusion and implications,” Frontiers in Public Health, vol. 7, p. 11, 2019.
- R. C. Palmer, D. Ismond, E. J. Rodriquez, and J. S. Kaufman, “Social determinants of health: future directions for health disparities research,” American Journal of Public Health, vol. 109, Supplement 1, pp. S70–S71, 2019.
- J. Vasilakes, R. Rizvi, G. B. Melton, S. Pakhomov, and R. Zhang, “Evaluating active learning methods for annotating semantic predications,” JAMIA Open, vol. 1, no. 2, pp. 275–282, 2018.
- Y. Chen, T. A. Lask, Q. Mei et al., “An active learning-enabled annotation system for clinical named entity recognition,” BMC Medical Informatics and Decision Making, vol. 17, Supplement 2, pp. 82–82, 2017.
- Q. Wei, Y. Chen, M. Salimi et al., “Cost-aware active learning for named entity recognition in clinical text,” Journal of the American Medical Informatics Association, vol. 26, no. 11, pp. 1314–1322, 2019.
- M. Mintz, S. Bills, R. Snow, and D. Jurafsky, “Distant supervision for relation extraction without labeled data,” in Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 1003–1011, Suntec, Singapore, August 2009, https://www.aclweb.org/anthology/P09-1113.
- M. N. Cantor and L. Thorpe, “Integrating data on social determinants of health into electronic health records,” Health Affairs, vol. 37, no. 4, pp. 585–590, 2018.
- R. Gold, E. Cottrell, A. Bunce et al., “Developing electronic health record (EHR) strategies related to health center patients’ social determinants of health,” Journal of the American Board of Family Practice, vol. 30, no. 4, pp. 428–447, 2017.
- J. A. M. J. Larson, L. Kirchner, and Surya, “Machine Bias,” Tech. Rep., ProPublica, January 2021, https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing?token=XgFT9qOh9SrXdoACbzMzWe_PuorElToO.
- R. B. Parikh, S. Teeple, and A. S. Navathe, “Addressing bias in artificial intelligence in health care,” JAMA, vol. 322, no. 24, pp. 2377-2378, 2019.
- T. T. Sharpe, C. Voûte, M. A. Rose, J. Cleveland, H. D. Dean, and K. Fenton, “Social determinants of HIV/AIDS and sexually transmitted diseases among black women: implications for health equity,” Journal of Women's Health, vol. 21, no. 3, pp. 249–254, 2012.
- A. Dinh, S. Miertschin, A. Young, and S. D. Mohanty, “A data-driven approach to predicting diabetes and cardiovascular disease with machine learning,” BMC Medical Informatics and Decision Making, vol. 19, no. 1, p. 211, 2019.
Copyright © 2021 Anusha Bompelli et al. Exclusive Licensee Peking University Health Science Center. Distributed under a Creative Commons Attribution License (CC BY 4.0).