Question posted 2011 · +11 upvotes
I have a simple problem that I’m hoping to resolve without using VBA but if that’s the only way it can be solved, so be it.
I have a file with multiple rows (all one column). Each row has data that looks something like this:
1 7.82E-13 >gi|297848936|ref|XP_00| 4-hydroxide gi|297338191|gb|23343|randomrandom
2 5.09E-09 >gi|168010496|ref|xp_00| 2-pyruvate
etc…
What I want is some way to extract the string of numbers that begin with “gi|” and end with a “|”. For some rows this might mean as many as 5 gi numbers, for others it’ll just be one.
What I would hope the output would look like would be something like:
297848936,297338191
168010496
etc…
Accepted answer +22 upvotes
Here is a very flexible VBA answer using the regex object. What the function does is extract every single sub-group match it finds (stuff inside the parenthesis), separated by whatever string you want (default is “, “). You can find info on regular expressions here: http://www.regular-expressions.info/
You would call it like this, assuming that first string is in A1:
=RegexExtract(A1,"gi[|](d+)[|]")
Since this looks for all occurance of “gi|” followed by a series of numbers and then another “|”, for the first line in your question, this would give you this result:
297848936, 297338191
Just run this down the column and you’re all done!
Function RegexExtract(ByVal text As String, _
ByVal extract_what As String, _
Optional separator As String = ", ") As String
Dim allMatches As Object
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
Dim i As Long, j As Long
Dim result As String
RE.pattern = extract_what
RE.Global = True
Set allMatches = RE.Execute(text)
For i = 0 To allMatches.count - 1
For j = 0 To allMatches.Item(i).submatches.count - 1
result = result & (separator & allMatches.Item(i).submatches.Item(j))
Next
Next
If Len(result) <> 0 Then
result = Right$(result, Len(result) - Len(separator))
End If
RegexExtract = result
End Function
3 code variants in this answer
- Variant 1 — 1 lines, starts with
=RegexExtract(A1,"gi[|](d+)[|]") - Variant 2 — 1 lines, starts with
297848936, 297338191 - Variant 3 — 27 lines, starts with
Function RegexExtract(ByVal text As String, _
Top excel-vba Q&A (6)
- How to clear the entire array? +58 (2010)
- How to change Format of a Cell to Text using VBA +55 (2011)
- Download attachment from Outlook and Open in Excel +43 (2012)
- Can a VBA function in Excel return a range? +36 (2009)
- 2 Dimensional array from range +34 (2013)
- Hiding an Excel worksheet with VBA +33 (2009)
excel-vba solutions on this site
— top 12%.