|𝔻⟩irac's Student: Automating dataset labeling

Could you use a LLM to label a material simply by using the crystal structure expressed in natural language?

I'm trying to see if this is true or not with the recent explosion of tools. I came across the tool Autolabel which uses LLMs to automatically label datasets with human level accuracy but using a fraction of the time. As to the technical details, its not clear what the clear advantage of this library other than it make use of few shot prompt engineering and similarity search to get the best LLM accuracy to label data.

So how do I think I might use this? Well What if you could label a material as magnetic or non-magnetic simply from a description of the crystal structure. How do you describe a crystal without using standard crystallography notation. Well in truth, the cif format is just formatted text, so you could use that, but the standard is very broad and the format can contain sparse details. What we can do is use a tool called Robocrystallographer which will describe a crystal structure from a format file such as cif or POSCAR and describe the environment in significant detail. This type of description is ideal for something like labeling.

The question now is if the information is expressive enough for a LLM to understand the context for what makes something magnetic or non-magnetic. So here is an example of a few shot prompt setup:

You are an expert in magnetic crystalline materials. Your job is to classify the magnetic ordering of a crystal into one of the following labels:
Non-magnetic
Magnetic

You will return the answer with just one element: "the correct label"

Some examples with their output answers are provided below:

Input: Cs(MoS)₃ crystallizes in the hexagonal P6₃/m space group. Cs is bonded in a 9-coordinate geometry to nine equivalent S atoms. There are three shorter (3.60 Å) and six longer (3.73 Å) Cs–S bond lengths. Mo is bonded in a distorted see-saw-like geometry to four equivalent S atoms. There are a spread of Mo–S bond distances ranging from 2.49–2.60 Å. S is bonded in a 7-coordinate geometry to three equivalent Cs and four equivalent Mo atoms.
Output: Non-magnetic

Input: CuCr₂Se₄ is Spinel structured and crystallizes in the cubic Fd̅3m space group. Cr³⁺ is bonded to six equivalent Se²⁻ atoms to form CrSe₆ octahedra that share corners with six equivalent CuSe₄ tetrahedra and edges with six equivalent CrSe₆ octahedra. All Cr–Se bond lengths are 2.52 Å. Cu²⁺ is bonded to four equivalent Se²⁻ atoms to form CuSe₄ tetrahedra that share corners with twelve equivalent CrSe₆ octahedra. The corner-sharing octahedral tilt angles are 57°. All Cu–Se bond lengths are 2.37 Å. Se²⁻ is bonded in a distorted rectangular see-saw-like geometry to three equivalent Cr³⁺ and one Cu²⁺ atom.
Output: Magnetic

Now I want you to label the following example:
Input: SrSn(PO₄)₂ crystallizes in the monoclinic C2/c space group. Sr²⁺ is bonded in a 8-coordinate geometry to eight O²⁻ atoms. There are a spread of Sr–O bond distances ranging from 2.61–2.99 Å. Sn⁴⁺ is bonded to six O²⁻ atoms to form SnO₆ octahedra that share corners with six equivalent PO₄ tetrahedra. There are a spread of Sn–O bond distances ranging from 2.03–2.10 Å. P⁵⁺ is bonded to four O²⁻ atoms to form PO₄ tetrahedra that share corners with three equivalent SnO₆ octahedra. The corner-sharing octahedral tilt angles range from 41–50°. There are a spread of P–O bond distances ranging from 1.52–1.58 Å. There are four inequivalent O²⁻ sites. In the first O²⁻ site, O²⁻ is bonded in a distorted bent 150 degrees geometry to one Sn⁴⁺ and one P⁵⁺ atom. In the second O²⁻ site, O²⁻ is bonded in a distorted single-bond geometry to two equivalent Sr²⁺ and one P⁵⁺ atom. In the third O²⁻ site, O²⁻ is bonded in a 3-coordinate geometry to one Sr²⁺, one Sn⁴⁺, and one P⁵⁺ atom. In the fourth O²⁻ site, O²⁻ is bonded in a 3-coordinate geometry to one Sr²⁺, one Sn⁴⁺, and one P⁵⁺ atom.
Output:

Now if I use Autolabel I can test out how this might work for different OpenAI models or even other LLMs. In addition, details on the level of confidence (not sure how this is calculated) are provided on the labels. The threshold for a good confidence level of a label is also provided based on my understanding. So say you have a label with 30% confidence that it is correct, but the threshold is around 55% for a reasonable confidence then this label would be rejected.

First results

So I went ahead and tried this out and if I can get it to work, I'll probably try to write a preprint. However, my early attempts are proving that this isn't much better than a random guess, or in some cases even worse. I actually did this on the magnetic ordering labels rather than on magnetic vs. non-magnetic, here are some example data points:

Although the confidence level is high for all labels, the accuracy is abysmal, 17%. This is why the threshold value seems to be needed as a guide for accepting labels, in this case the threshold needs to be increased. I have seen that for more prompt shot examples it gets a bit better. The use of GPT-4 also seems to improve the accuracy considerably but the cost is very high. It probably would have been better just to use the labels magnetic and non-magnetic since magnetic ordering has its own symmtry related to electron spin which won't be captured by the description generated from Robocrystallographer.

As I do more on this I'll share my code but at the moment I'm holding off since it needs some further testing. Also I'm hoping to publish this on my spare time if it shows significant promise in being able to label crystal structures from natural language descriptions. You can imagine other labels like whether or not a material is piezoelectric.

For me to be convinced this is the way to go, the accuracy on a test dataset needs to be in the 90% range. This is just an opinion and my bias towards wanting high-quality labeled data to then go do other stuff with the labeled dataset. I'll probably write an update to this post as I keep playing around with this.

Update, 26 June, 2023

I've been trying this approach out and have been able to get better performance, especially when using GPT-4. The problem is the cost for labeling a dataset with 1000 entries is high. I've been able to get the accuracy higher, close to 70%, but the problem is in practice I would only want to keep labels with confidence scores > 90% and many of the labels have very low confidence scores and therefore are probably just good guesses by the LLM.

References

[1] Autolabel. https://github.com/refuel-ai/autolabel. Accessed 21 June 2023.
[2] Ganose, Alex M., and Anubhav Jain. “Robocrystallographer: Automated Crystal Structure Text Descriptions and Analysis.” MRS Communications, vol. 9, no. 3, Sept. 2019, pp. 874–81. Springer Link, https://doi.org/10.1557/mrc.2019.94.

Reuse and Attribution

|𝔻⟩irac's Student

Search Blogs

Thursday, June 22, 2023

Automating dataset labeling

First results

References

No comments:

Post a Comment