Healthcare Market Review 2009-2010
|
|
Living In A Free Text World Article in The Laing and Buisson Healthcare Market Review 2009
|
Living In a Free Text World
Paul Louth – Head Systems Architect, Medical Management Systems Ltd.
This year the online practice management system Med+DBase has implemented a ground-breaking new clinical coding system, harnessing the intelligence of SNOMED CT. Here we examine the technical background to the coding system and how Med+DBase extracts clinical meaning from unstructured documents.
With the advent of the SNOMED CT clinical coding system, computer systems now have a greater range of concepts that can be coded. Where previously a coding system may focus on diagnosis, or drugs and devices, or even what an insurer will pay, SNOMED CT tries to capture everything that may ever be represented on a medical record. This represents a sea-change in the capabilities of clinical coding.
However, the enormous scope of SNOMED CT means that we now have a clinical coding system which contains around 600,000 concepts. Gone are the days of remembering the codes of the few common procedures or drugs.
SNOMED CT covers disorders, diseases, drugs, events, findings, food stuffs, lifestyles, occupations, scientific concepts, units of measurements and much, much more. It also, importantly, has a method for representing relationships between concepts.
Relationships
A relationship may be the simple 'is a' relationship, for example:
Bacterial pneumonia is a Infective pneumonia
All three parts of the above statement have their own codes:
Bacterial pneumonia (disorder) = 53084003
is a (attribute) = 116680003
Infective pneumonia (disorder) = 312342009
The 'is a' relationship helps build a huge hierarchical structure which allows the computer systems to infer more generic characteristics. The value of this comes when doing generic searches for features of a medical record. For example, if a patient has the Bacterial pneumonia code attached to their medical record but you wanted to find every patient with a respiratory disease of any kind, then the 'is a' hierarchy allows that inference:
Bacterial pneumonia is a Infective pneumonia
Infective pneumonia is a Pneumonia
Pneumonia is a Disease of lung
Disease of lung is a Disease of respiratory system
The relationship model doesn't stop there; it also covers severities, episodicity, clinical course, associate morphology, finding sites, etc. If we go back to our bacterial pneumonia example, it has a severity relationship which links bacterial pneumonia to the 'severities (qualifier value)' concept. What that means is that the definition of bacterial pneumonia can be further refined by combining it with a severity. For example 'severe bacterial pneumonia':
Bacterial pneumonia (disorder) = 53084003
Severities (qualifier value) = 272141005
Severe (severity modifier) (qualifier value): 24484000
By combining codes in the computer system, the medical record can start to represent clinical statements rather than clinical terms. Another example is 'Fracture of bone (disorder): 125605004' has not only a severity, but also a finding site which allows for a more refined meaning.
A medical record which consists of clinical statements rather than clinical terms is much more complete, and also much more useful in terms of interrogation. If you want to find all patients who have been prescribed a particular type of drug and have, for example, severe asthma, rather than just asthma, then clinical statements will allow that, individual clinical terms won't.
Automated Clinical Coding
Clearly however, it is not reasonable for a clinician to find these codes themselves and be expected to compose them into clinical statements. Searching a vast database for clinical terms and their related terms would be incredibly slow and is a very poor use of a clinician's time. It is of course possible to have a person dedicated to coding the medical record, which already happens today, but not all practices can afford that luxury.
What is needed is intelligence in the computer system, so it can take a chunk of free-text (multiple sentences, paragraphs, etc.) and extract the clinical terms, and then post-coordinate them into clinical statements. It would need to understand that humans can misspell terms, as well as use acronyms.
Even more importantly it needs to understand what a sentence is. Whilst this is an easy task for a human, it's not so easy for a computer. If you asked the average person what the definition of a sentence is, they'd probably say something along the lines of "one or more words, first word starts with a capital letter, and the last word is followed by a full-stop".
Here are some example sentences which make that definition too simplistic:
The bread costs £1.20 and the milk costs £1.50.
Harold likes fast cars, e.g. Porsches and Ferraris.
I'm going to the place where I work best, i.e. the coffee shop.
I'm really excited about this sentence!
Those four sentences show that a '.' doesn't always mean the end of a sentence. Although we know the difference between a full-stop, a decimal point, or acronym dots, computers don't. The very same key on the keyboard is used for all of them. Also there's a range of possible punctuation which is valid for the end of a sentence.
Why is it important that we know where sentences begin and end? Well, extracting the clinical terms from a body of text is relatively straight-forward (more on that later), but then we need to combine them into clinical statements. If you join terms across sentence boundaries then you are likely to be joining terms which aren't relevant to each other, and could give spurious results. Also, if you are overzealous with finding sentence boundaries, then you may not join terms which should be connected.
Unfortunately it's nigh on impossible to know exactly where a sentence ends to 100% accuracy. The English language has so much diversity, and breaks so many of its own rules that building a 'decision engine' to check the various permutations would be incredibly complex. Also the use of less structured notes by clinicians in an often minimal grammatical form makes automated comprehension doubly difficult.
So how do we do it? Humans have been trained to spot patterns in sentences, which over time our brains refine. But fundamentally your brain builds a statistical model of the English language. What this means is that even humans aren't 100% accurate. If your brain was a little bit more verbose it might say "I am 95% sure that's the end of a sentence", at which point another part of the brain would say "That's good enough for me!”.
That's how we should approach the task of understanding the sentence in a computer system. However to understand an English sentence, we also need to understand English. Wow, so we now need to teach a computer the intricacies of the English language. It's not getting any easier!
NLP
Luckily the field of NLP comes to the rescue. No, not the widely-discredited neuro-linguistic programming, but the field of natural-language processing.
Natural-language processing is a field of computer science and linguistics concerned with the interactions between computers and human (natural) languages. It is known as an AI-complete problem, because natural-language recognition requires extensive knowledge of the outside world. The definition of "understanding" is one of the major problems in natural-language processing.
So to get around the fact that we can't create a truly artificially intelligent machine which "understands", we must get humans to provide some of the intelligence by analysing real world texts. This involves getting a copy of the text from a newspaper, or from clinical notes, and marking it with special tags: tags for the end of a sentence, for adjectives, nouns, conjunctions etc. The computer can then read these tags along with the text, and can internally build up a statistical model. The more text and tags that are provided, the more accurate it becomes.
Once the model has been trained sufficiently, it is possible to feed it un-tagged text. The model can then say "I'm 95% sure that's the end of a sentence". It can also understand grammatical concepts without ever having been given a set of grammatical rules. This is known as statistical natural-language processing.
So once we understand what a sentence is, can we use the age old technique of keyword matching to find clinical concepts? Not quite. Even in this area there are potential pit-falls, take a look at the following:
Myocardial Infarction
Heart attack
MI
All three terms mean the same thing. But there is only one SNOMED CD term: Myocardial Infarction (22298006).
Now look at this:
Neck pain
Pain in the neck
Both represent the same concept, but the order of the words means the phrase 'looks' different to the computer system
Next:
Inflamed
Inflammatory
Inflammation
Three variants of the same word. Keyword matching isn't a fuzzy comparison, it's an exact comparison. So if you've typed "Inflammation of the tonsils", it won't match "Inflamed tonsils", which is SNOMED CT term 281795003.
The key to this is to normalise the phrase before comparison, but also to normalise all terms stored in the clinical-terms database. The normalisation process includes stemming, tokenising the text and spelling variation generation ('pneumonia' and 'pnuemonia'). Through this process you can step through a sentence marking up the clinical terms.
Values are important too, like dates, measurement, etc. It's important to extract the value, and the unit of measurement and then associate it with the correct term. This can be achieved with a set of regular expressions and the 'units' sub-section of SNOMED CT: 258666001. Relevance to a term can be approximated by the distance from the term phrase in words.
Negation
Another complexity to the process of extracting clinical statements is one of negation. For example:
"The patient is pregnant but has no back pain"
Clearly you wouldn't want the medical record to record the fact the patient has back pain because of a simple keyword match. You would want to record the fact they haven't got any back pain however, so the ability to negate terms is important. You might want to later search for negated terms, so leaving them off the medical record isn't wise.
Negation isn't as simple as looking for the word 'no' before a word however; there are many ways to negate. There are:
- Pre-concept negations, e.g. "absence of", "denied", "never had", "no sign of", etc.
- Post-concept negations, e.g. "unlikely", "was ruled out", etc.
- Pre-concept conditional possibility phrases, e.g. "rule him out", "must be ruled out"
- Post-concept conditional possibility phrases, e.g. "may be ruled out", "will be ruled out", etc.
Also the system should know when in a sentence the negation ends. For example:
"The patient denies back pain however they are reporting neck pain".
The first part of the sentence negates the term "back pain", but positives "neck pain". So the system must understand how conjunctions work, and which conjunctions reduce the scope of the negation. The scope reduction conjunction is "however" in this case.
Putting it all together
Once the components of a sentence are understood, including the clinical terms, negations, values with units, dates and conjunctions, it's possible to process them to create well defined clinical statements using the relationships which were mentioned earlier.
Med+DBase, which is our flagship practice management solution, has all of these features as standard, and allows for a high-level of data integrity in the patient medical records. It also means you can retrospectively code documents which up until now have remained as plain text, or as naively coded by simple keyword matching software. This gives much more coverage of the patient record.
The research team at Medical Management Systems are constantly looking into ways of improving the quality of patient medical records stored in our data-centres, but just as importantly, making the user experience totally seamless, as clinical coding becomes an invisible process.